100+ datasets found
  1. Meta Kaggle Code

    • kaggle.com
    zip
    Updated Nov 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
    Explore at:
    zip(167219625372 bytes)Available download formats
    Dataset updated
    Nov 27, 2025
    Dataset authored and provided by
    Kagglehttp://kaggle.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Explore our public notebook content!

    Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

    Why we’re releasing this dataset

    By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

    Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

    The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

    Sensitive data

    While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

    Joining with Meta Kaggle

    The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

    File organization

    The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

    The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

    Questions / Comments

    We love feedback! Let us know in the Discussion tab.

    Happy Kaggling!

  2. Data from: KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle

    • zenodo.org
    • dataon.kisti.re.kr
    • +1more
    bin, bz2, pdf
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luigi Quaranta; Fabio Calefato; Fabio Calefato; Filippo Lanubile; Filippo Lanubile; Luigi Quaranta (2024). KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle [Dataset]. http://doi.org/10.5281/zenodo.4468523
    Explore at:
    bz2, pdf, binAvailable download formats
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Luigi Quaranta; Fabio Calefato; Fabio Calefato; Filippo Lanubile; Filippo Lanubile; Luigi Quaranta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    KGTorrent is a dataset of Python Jupyter notebooks from the Kaggle platform.

    The dataset is accompanied by a MySQL database containing metadata about the notebooks and the activity of Kaggle users on the platform. The information to build the MySQL database has been derived from Meta Kaggle, a publicly available dataset containing Kaggle metadata.

    In this package, we share the complete KGTorrent dataset (consisting of the dataset itself plus its companion database), as well as the specific version of Meta Kaggle used to build the database.

    More specifically, the package comprises the following three compressed archives:

    1. KGT_dataset.tar.bz2, the dataset of Jupyter notebooks;

    2. KGTorrent_dump_10-2020.sql.tar.bz2, the dump of the MySQL companion database;

    3. MetaKaggle27Oct2020.tar.bz2, a copy of the Meta Kaggle version used to build the database.

    Moreover, we include KGTorrent_logical_schema.pdf, the logical schema of the KGTorrent MySQL database.

  3. Top 1000 Kaggle Datasets

    • kaggle.com
    zip
    Updated Jan 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Trrishan (2022). Top 1000 Kaggle Datasets [Dataset]. https://www.kaggle.com/datasets/notkrishna/top-1000-kaggle-datasets
    Explore at:
    zip(34269 bytes)Available download formats
    Dataset updated
    Jan 3, 2022
    Authors
    Trrishan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    From wiki

    Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.

    Kaggle got its start in 2010 by offering machine learning competitions and now also offers a public data platform, a cloud-based workbench for data science, and Artificial Intelligence education. Its key personnel were Anthony Goldbloom and Jeremy Howard. Nicholas Gruen was founding chair succeeded by Max Levchin. Equity was raised in 2011 valuing the company at $25 million. On 8 March 2017, Google announced that they were acquiring Kaggle.[1][2]

    Source: Kaggle

  4. h

    test-dataset-kaggle

    • huggingface.co
    Updated Feb 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gholamreza Dar (2024). test-dataset-kaggle [Dataset]. https://huggingface.co/datasets/Gholamreza/test-dataset-kaggle
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 15, 2024
    Authors
    Gholamreza Dar
    Description

    Gholamreza/test-dataset-kaggle dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. Student Performance Dataset

    • kaggle.com
    Updated Aug 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ghulam Muhammad Nabeel (2025). Student Performance Dataset [Dataset]. https://www.kaggle.com/datasets/nabeelqureshitiii/student-performance-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 27, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ghulam Muhammad Nabeel
    Description

    πŸ“Š Student Performance Dataset (Synthetic, Realistic)

    Overview

    This dataset contains 1000000 rows of realistic student performance data, designed for beginners in Machine Learning to practice Linear Regression, model training, and evaluation techniques.

    Each row represents one student with features like study hours, attendance, class participation, and final score.
    The dataset is small, clean, and structured to be beginner-friendly.

    πŸ”‘ Columns Description

    • student_id β†’ Unique identifier for each student.
    • weekly_self_study_hours β†’ Average weekly self-study hours (0–40). Generated using a normal distribution centered around 15 hours.
    • attendance_percentage β†’ Attendance percentage (50–100). Simulated with a normal distribution around 85%.
    • class_participation β†’ Score between 0–10 indicating how actively the student participates in class. Generated from a normal distribution centered around 6.
    • total_score β†’ Final performance score (0–100). Calculated as a function of study hours + random noise, then clipped between 0–100. Stronger correlation with study hours.
    • grade β†’ Categorical label (A, B, C, D, F) derived from total_score.

    πŸ“ Data Generation Logic

    1. Weekly Study Hours: Modeled using a normal distribution (mean β‰ˆ 15, std β‰ˆ 7), capped between 0 and 40 hours.
    2. Scores: More study hours β†’ higher score. Formula:

    Random noise simulates differences in learning ability, motivation, etc.

    1. Attendance & Participation: Independent but realistic variations added.
    2. Grades: Assigned from scores using thresholds:
    • A: β‰₯ 85
    • B: β‰₯ 70
    • C: β‰₯ 55
    • D: β‰₯ 40
    • F: < 40

    🎯 How to Use This Dataset

    Regression Tasks

    • Predict total_score from weekly_self_study_hours.
    • Train and evaluate Linear Regression models.
    • Extend to multiple regression using attendance_percentage and class_participation.

    Classification Tasks

    • Predict grade (A–F) using study hours, attendance, and participation.

    Model Evaluation Practice

    • Apply train-test split and cross-validation.
    • Evaluate with MAE, RMSE, RΒ².
    • Compare simple vs. multiple regression.

    βœ… This dataset is intentionally kept simple, so that new ML learners can clearly see the relationship between input features (study, attendance, participation) and output (score/grade).

  6. h

    kaggle-entity-annotated-corpus-ner-dataset

    • huggingface.co
    Updated Jul 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Arias Calles (2022). kaggle-entity-annotated-corpus-ner-dataset [Dataset]. https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 10, 2022
    Authors
    Rafael Arias Calles
    License

    https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/

    Description

    Date: 2022-07-10 Files: ner_dataset.csv Source: Kaggle entity annotated corpus notes: The dataset only contains the tokens and ner tag labels. Labels are uppercase.

      About Dataset
    

    from Kaggle Datasets

      Context
    

    Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set. Tip: Use Pandas Dataframe to load dataset if using Python for… See the full description on the dataset page: https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset.

  7. Kaggle Dataset Metadata Repository

    • kaggle.com
    zip
    Updated Nov 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ijaj Ahmed (2024). Kaggle Dataset Metadata Repository [Dataset]. https://www.kaggle.com/datasets/ijajdatanerd/kaggle-dataset-metadata-repository
    Explore at:
    zip(5122110 bytes)Available download formats
    Dataset updated
    Nov 16, 2024
    Authors
    Ijaj Ahmed
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13367141%2F444a868e669671faf9007822d6f2d348%2FAdd%20a%20heading.png?generation=1731775788329917&alt=media" alt="">

    Kaggle Dataset Metadata Collection πŸ“Š

    This dataset provides comprehensive metadata on various Kaggle datasets, offering detailed information about the dataset owners, creators, usage statistics, licensing, and more. It can help researchers, data scientists, and Kaggle enthusiasts quickly analyze the key attributes of different datasets on Kaggle. πŸ“š

    Dataset Overview:

    • Purpose: To provide detailed insights into Kaggle dataset metadata.
    • Content: Information related to the dataset's owner, creator, usage metrics, licensing, and more.
    • Target Audience: Data scientists, Kaggle competitors, and dataset curators.

    Columns Description πŸ“‹

    • datasetUrl 🌐: The URL of the Kaggle dataset page. This directs you to the specific dataset's page on Kaggle.

    • ownerAvatarUrl πŸ–ΌοΈ: The URL of the dataset owner's profile avatar on Kaggle.

    • ownerName πŸ‘€: The name of the dataset owner. This can be the individual or organization that created and maintains the dataset.

    • ownerUrl 🌍: A link to the Kaggle profile page of the dataset owner.

    • ownerUserId πŸ’Ό: The unique user ID of the dataset owner on Kaggle.

    • ownerTier πŸŽ–οΈ: The ownership tier, such as "Tier 1" or "Tier 2," indicating the owner's status or level on Kaggle.

    • creatorName πŸ‘©β€πŸ’»: The name of the dataset creator, which could be different from the owner.

    • creatorUrl 🌍: A link to the Kaggle profile page of the dataset creator.

    • creatorUserId πŸ’Ό: The unique user ID of the dataset creator.

    • scriptCount πŸ“œ: The number of scripts (kernels) associated with this dataset.

    • scriptsUrl πŸ”—: A link to the scripts (kernels) page for the dataset, where you can explore related code.

    • forumUrl πŸ’¬: The URL to the discussion forum for this dataset, where users can ask questions and share insights.

    • viewCount πŸ‘€: The number of views the dataset page has received on Kaggle.

    • downloadCount ⬇️: The number of times the dataset has been downloaded by users.

    • dateCreated πŸ“…: The date when the dataset was first created and uploaded to Kaggle.

    • dateUpdated πŸ”„: The date when the dataset was last updated or modified.

    • voteButton πŸ‘: The metadata for the dataset's vote button, showing how users interact with the dataset's quality ratings.

    • categories 🏷️: The categories or tags associated with the dataset, helping users filter datasets based on topics of interest (e.g., "Healthcare," "Finance").

    • licenseName πŸ›‘οΈ: The name of the license under which the dataset is shared (e.g., "CC0," "MIT License").

    • licenseShortName πŸ”‘: A short form or abbreviation of the dataset's license name (e.g., "CC0" for Creative Commons Zero).

    • datasetSize πŸ“¦: The size of the dataset in terms of storage, typically measured in MB or GB.

    • commonFileTypes πŸ“‚: A list of common file types included in the dataset (e.g., .csv, .json, .xlsx).

    • downloadUrl ⬇️: A direct link to download the dataset files.

    • newKernelNotebookUrl πŸ“: A link to a new kernel or notebook related to this dataset, for those who wish to explore it programmatically.

    • newKernelScriptUrl πŸ’»: A link to a new script for running computations or processing data related to the dataset.

    • usabilityRating 🌟: A rating or score representing how usable the dataset is, based on user feedback.

    • firestorePath πŸ”: A reference to the path in Firestore where this dataset’s metadata is stored.

    • datasetSlug 🏷️: A URL-friendly version of the dataset name, typically used for URLs.

    • rank πŸ“ˆ: The dataset's rank based on certain metrics (e.g., downloads, votes, views).

    • datasource 🌐: The source or origin of the dataset (e.g., government data, private organizations).

    • medalUrl πŸ…: A URL pointing to the dataset's medal or badge, indicating the dataset's quality or relevance.

    • hasHashLink πŸ”—: Indicates whether the dataset has a hash link for verifying data integrity.

    • ownerOrganizationId 🏒: The unique organization ID of the dataset's owner if the owner is an organization rather than an individual.

    • totalVotes πŸ—³οΈ: The total number of votes the dataset has received from users, reflecting its popularity or quality.

    • category_names πŸ“‘: A comma-separated string of category names that represent the dataset’s classification.

    This dataset is a valuable resource for those who want to analyze Kaggle's ecosystem, discover high-quality datasets, and explore metadata in a structured way. πŸŒπŸ“Š

  8. Kaggle Annotation Dataset

    • universe.roboflow.com
    zip
    Updated Apr 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    kaggle annotate (2025). Kaggle Annotation Dataset [Dataset]. https://universe.roboflow.com/kaggle-annotate/kaggle-annotation
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 2, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    kaggle annotate
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Objects Bounding Boxes
    Description

    Kaggle Annotation

    ## Overview
    
    Kaggle Annotation is a dataset for object detection tasks - it contains Objects annotations for 965 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  9. 60k-data-with-context-v2

    • kaggle.com
    Updated Sep 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chris Deotte (2023). 60k-data-with-context-v2 [Dataset]. https://www.kaggle.com/datasets/cdeotte/60k-data-with-context-v2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 2, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Chris Deotte
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset can be used to train an Open Book model for Kaggle's LLM Science Exam competition. This dataset was generated by searching and concatenating all publicly shared datasets on Sept 1 2023.

    The context column was generated using Mgoksu's notebook here with NUM_TITLES=5 and NUM_SENTENCES=20

    The source column indicates where the dataset originated. Below are the sources:

    source = 1 & 2 * Radek's 6.5k dataset. Discussion here annd here, dataset here.

    source = 3 & 4 * Radek's 15k + 5.9k. Discussion here and here, dataset here

    source = 5 & 6 * Radek's 6k + 6k. Discussion here and here, dataset here

    source = 7 * Leonid's 1k. Discussion here, dataset here

    source = 8 * Gigkpeaeums 3k. Discussion here, dataset here

    source = 9 * Anil 3.4k. Discussion here, dataset here

    source = 10, 11, 12 * Mgoksu 13k. Discussion here, dataset here

  10. πŸ“±πŸ’» Mental Health & Technology Usage Dataset 🌱

    • kaggle.com
    zip
    Updated Sep 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Waqar Ali (2024). πŸ“±πŸ’» Mental Health & Technology Usage Dataset 🌱 [Dataset]. https://www.kaggle.com/datasets/waqi786/mental-health-and-technology-usage-dataset
    Explore at:
    zip(201342 bytes)Available download formats
    Dataset updated
    Sep 5, 2024
    Authors
    Waqar Ali
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset offers insights into how daily technology usage, including social media and screen time, impacts mental health. πŸ“Š It captures various behavioral patterns and their correlations with mental health indicators like stress levels, sleep quality, and productivity. Dive in to analyze the relationship between our digital lives and mental wellness! 🌟

    The data is useful for research, academic projects, or building predictive models to understand trends in mental health influenced by screen time and technology habits. πŸ”πŸ“‰

  11. Learning Resources Database

    • kaggle.com
    • datadiscovery.nlm.nih.gov
    • +3more
    zip
    Updated Nov 5, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prasad Patil (2023). Learning Resources Database [Dataset]. https://www.kaggle.com/datasets/prasad22/learning-resources-database
    Explore at:
    zip(82916 bytes)Available download formats
    Dataset updated
    Nov 5, 2023
    Authors
    Prasad Patil
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The Learning Resources Database is a catalog of interactive tutorials, videos, online classes, finding aids, and other instructional resources on National Library of Medicine (NLM) products and services. Resources may be available for immediate use via a browser or downloadable for use in course management systems

    Dataset Description

    It contains 520 rows and 13 variables as listed below - - Resource ID : Alphanumeric identifier - Resource Name : Title of the resource - Resource URL : Link of the resource - Description : Brief explanation on the reource - Archived : Flagged as False for all data points - Format : Format of the resource ex. HTML, PDF, MP4 video , MS Word, Powerpoint etc. - Type : Type of the resource ex Webinar, document, tutorial, slides etc. - Runtime : Runtime of the resource - Subject Areas : Topic covered in reource - Authoring Organization : Name of the Authoring Organization - Intended Audiences : Profile of the intended audience - Record Modified : Timestamp info on record last modification - Resource Revised : Timestamp info on resource last modified

  12. Kaggle Top DatasetsπŸš€πŸ“Š

    • kaggle.com
    zip
    Updated Apr 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aaron Frias (2024). Kaggle Top DatasetsπŸš€πŸ“Š [Dataset]. https://www.kaggle.com/datasets/aaronfriasr/kaggle-top-datasets
    Explore at:
    zip(1572305 bytes)Available download formats
    Dataset updated
    Apr 10, 2024
    Authors
    Aaron Frias
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Context

    Kaggle is one of the largest communities of data scientists and machine learning practitioners in the world, and its platform hosts thousands of datasets covering a wide range of topics and industries. With so many options to choose from, it can be difficult to know where to start or what datasets are worth exploring. That's where this dataset comes in. By scraping information about the top 10,000 datasets on Kaggle, we have created a single source of truth for the most popular and useful datasets on the platform. This dataset is not just a list of names and numbers, but a valuable tool for data enthusiasts and professionals alike, providing insights into the latest trends and techniques in data science and machine learning

    Column description - Dataset_name - Name of the dataset - Author_name - Name of the author - Author_id - Kaggle id of the author - No_of_files - Number of files the author has uploaded - size - Size of all the files - Type_of_file - Type of the files such as csv, json etc. - Upvotes - Total upvotes of the dataset - Medals - Medal of the dataset - Usability - Usability of the dataset - Date - Date in which the dataset is uploaded - Day - Day in which the dataset is uploaded - Time - Time in which the dataset is uploaded - Dataset_link - Kaggle link of the dataset

    Acknowledgements The data has been scraped from the official Kaggle Website and is available under the Creative Common License.

    Enjoy & Keep Learning !!!

  13. (Sunset)πŸ“’ Meta Kaggle ported to MS SQL SERVER

    • kaggle.com
    zip
    Updated Mar 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BwandoWando (2024). (Sunset)πŸ“’ Meta Kaggle ported to MS SQL SERVER [Dataset]. https://www.kaggle.com/datasets/bwandowando/meta-kaggle-ported-to-sql-server-2022-database
    Explore at:
    zip(8635902534 bytes)Available download formats
    Dataset updated
    Mar 20, 2024
    Authors
    BwandoWando
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    I've always wanted to explore Kaggle's Meta Kaggle dataset but I am more comfortable on using TSQL when it comes to writing (very) complex queries. Also, I tend to write queries faster when using SQL MANAGEMENT STUDIO, like 100x faster. So, I ported Kaggle's Meta Kaggle dataset into MS SQL SERVER 2022 database format, created a backup file, then uploaded it here.

    • MSSQL VERSION: SQL Server 2022
    • Collation: SQL_Latin1_General_CP1_CI_AS
    • Recovery model: simple

    Requirements

    • Download and install the SQL SERVER 2022 Developer edition here
    • Download the backup file
    • Restore the backup file into your local. If you havent done this before, it's easy and straightforward. Here is a guide.

    (QUOTED FROM THE ORIGINAL DATASET)

    Meta Kaggle

    Explore Kaggle's public data on competitions, datasets, kernels (code/ notebooks) and more Meta Kaggle may not be the Rosetta Stone of data science, but they think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1842206%2F2ad97bce7839d6e57674e7a82981ed23%2F2Egeb8R.png?generation=1688912953875842&alt=media" alt="">

    Notes

  14. Social Media and Mental Health

    • kaggle.com
    zip
    Updated Jul 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SouvikAhmed071 (2023). Social Media and Mental Health [Dataset]. https://www.kaggle.com/datasets/souvikahmed071/social-media-and-mental-health
    Explore at:
    zip(10944 bytes)Available download formats
    Dataset updated
    Jul 18, 2023
    Authors
    SouvikAhmed071
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    This dataset was originally collected for a data science and machine learning project that aimed at investigating the potential correlation between the amount of time an individual spends on social media and the impact it has on their mental health.

    The project involves conducting a survey to collect data, organizing the data, and using machine learning techniques to create a predictive model that can determine whether a person should seek professional help based on their answers to the survey questions.

    This project was completed as part of a Statistics course at a university, and the team is currently in the process of writing a report and completing a paper that summarizes and discusses the findings in relation to other research on the topic.

    The following is the Google Colab link to the project, done on Jupyter Notebook -

    https://colab.research.google.com/drive/1p7P6lL1QUw1TtyUD1odNR4M6TVJK7IYN

    The following is the GitHub Repository of the project -

    https://github.com/daerkns/social-media-and-mental-health

    Libraries used for the Project -

    Pandas
    Numpy
    Matplotlib
    Seaborn
    Sci-kit Learn
    
  15. Kaggle: Forum Discussions

    • kaggle.com
    zip
    Updated Nov 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NicolΓ‘s Ariel GonzΓ‘lez MuΓ±oz (2025). Kaggle: Forum Discussions [Dataset]. https://www.kaggle.com/datasets/nicolasgonzalezmunoz/kaggle-forum-discussions
    Explore at:
    zip(542099 bytes)Available download formats
    Dataset updated
    Nov 8, 2025
    Authors
    NicolΓ‘s Ariel GonzΓ‘lez MuΓ±oz
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Note: This is a work in progress, and not all the Kaggle forums are included in this dataset. The remaining forums will be added when I end solving some issues with the data generators related to these forums.

    Summary

    Welcome to the Kaggle Forum Discussions dataset!. This dataset contains curated data about recent discussions opened in the different forums on Kaggle. The data is obtained through web scraping techniques, using the selenium libraries, and converting text data into markdown style using the markdownify package.

    This dataset contains information about the discussion main topic, topic title, comments, votes, medals and more, and is designed to serve as a complement to the data available on the Kaggle meta dataset, specifically for recent discussions. Keep reading to see the details.

    Extraction Technique

    As a dynamic website that relies heavily in JavaScript (JS), I extracted the data in this dataset through web scraping techniques using the selenium library.

    The functions and classes used to scrape the data on Kaggle where stored on a utility script publicly available here. As JS-generated pages like Kaggle are unstable where trying to scrape them, the mentioned script implements capabilities for retrying connections and to await for elements to appear.

    Each Forum was scrapped using a one notebook for each, then the mentioned notebooks were connected to a central notebook that generates this dataset. Also the discussions are scrapped in parallel so to enhance speed. This dataset represents all the data that can be gathered in a single notebook session, from the most recent to the most old.

    If you need more control on the data you want to research, feel free to import all you need from the utility script mentioned before.

    Structure

    This dataset contains several folders, each named as the discussion forum they contain data about. For example, the 'competition-hosting' folder contains data about the Competition Hosting forum. Inside each folder, you'll find two files: one is a csv file and the other a json file.

    The json file (in Python, represented as a dictionary) is indexed with the ID that Kaggle assigns to the mentioned discussion. Each ID is paired with its corresponding discussion, which is represented as a nested dictionary (the discussion dict), which contains the following fields: - title: The title of the main topic. - content: Content of the main topic. - tags: List containing the discussion's tags. - datetime: Date and time at which the discussion was published (in ISO 8601 format). - votes: Number of votes gotten by the discussion. - medal: Medal awarded by the main topic (if any). - user: User that published the main topic. - expertise: Publisher's expertise, measured by the Kaggle progression system. - n_comments: Total number of comments in the current discussion. - n_appreciation_comments: Total number of appreciation comments in the current discussion. - comments: Dictionary containing data about the comments in the discussion. Each comment is indexed by an ID assigned by Kaggle, containing the following fields: - content: Comment's content. - is_appreciation: Wether the comment is of appreciation. - is_deleted: Wether the comment was deleted. - n_replies: Number of replies to the comment. - datetime: Date and time at which the comment was published (in ISO 8601 format). - votes: Number of votes gotten by the current comment. - medal: Medal awarded by the comment (if any). - user: User that published the comment. - expertise: Publisher's expertise, measured by the Kaggle progression system. - n_deleted: Total number of deleted replies (including self). - replies: A dict following this same format.

    By other side, the csv file serves as a summary of the json file, containing information about the comments limited to the hottest and most voted comments.

    Note: Only the 'content' field is mandatory for each discussion. The availability of the other fields is subject to the stability of the scraping tasks, which may also affect the update frequency.

  16. LMS Tracking Dataset

    • kaggle.com
    zip
    Updated May 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prasad Patil (2024). LMS Tracking Dataset [Dataset]. https://www.kaggle.com/datasets/prasad22/lms-tracking-dataset
    Explore at:
    zip(5419 bytes)Available download formats
    Dataset updated
    May 6, 2024
    Authors
    Prasad Patil
    Description

    This dataset was collected by a edtech startup. The startup is into teaching entrepreneurial life-skills in animated-gamified format through its video series to kids between the age group of 6-14 years. Through its learning management system the company tracks the progress made by all of its subscribers on the platform. Company records platform content usage activity data and tries to follow up with parents if there is any inactiveness on the platform by their child. Here's more information about the dataset

    Dataset Information:

    • Child Name: Name of the subscriber kid
    • Email Address: Email address created by parent
    • Contact: Contact details of the parent
    • follow up: Responses received by the company employee after progress follow-up over the phone.
    • response: segregating the follow-up responses in to categories
    • Introduction: Tutorial 1
    • Activity:- Know your personality, a fun way:Tutorial 2
    • A Simple Quiz on the previous Video: Quiz on the Tutorial 2
    • Lets see what β€˜Product’ is…:Tutorial 3
    • A Simple Quiz on the previous Video:Quiz on the Tutorial 3
    • Product that represents me: Tutorial 4
    • Let's see what 'Service' means: Tutorial 5
    • A Simple Quiz on the previous Video:Quiz on the Tutorial 5
    • Instruction for 'Product & Service' worksheet:Tutorial 6
    • Activity:- Product and Service Worksheet: Exercise on Tutorial 6
    • Instructions for Product Word Association:Tutorial 7
    • Activity:- Product Word Association:Exercise on Tutorial 7
    • Life without products??.... Impossible !:Tutorial 8
    • What Is a Need?:Tutorial 9
    • A Simple Quiz on the previous Video:Quiz on the Tutorial 9
    • Summary of Session 1: Summarizing all the learnings from the Tutorials 1-9
    • Your Feedback on Session 1: Feedback page

    There is some missing data as well. I hope it would be good dataset for beginners practicing their NLP skills.

    Image by Steven Weirather from Pixabay

    Note: This dataset is partially synthetic meaning names, email and contact details mentioned are not of the actual customers. Kindly use it for educational and research purposes.

  17. Arcade Natural Language to Code Challenge

    • kaggle.com
    zip
    Updated Feb 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google AI (2023). Arcade Natural Language to Code Challenge [Dataset]. https://www.kaggle.com/datasets/googleai/arcade-nl2code-dataset
    Explore at:
    zip(3921922 bytes)Available download formats
    Dataset updated
    Feb 22, 2023
    Dataset authored and provided by
    Google AI
    Description

    Arcade: Natural Language to Code Generation in Interactive Computing Notebooks

    Arcade is a collection of natural language to code problems on interactive data science notebooks. Each problem features an NL intent as problem specification, a reference code solution, and preceding notebook context (Markdown or code cells). Arcade can be used to evaluate the accuracies of code large language models in generating data science programs given natural language instructions. Please read our paper for more details.

    NoteπŸ‘‰ This Kaggle dataset only contains the dataset files of Arcade. Refer to our main Github repository for detailed instructions to use this dataset.

    Folder Structure

    Below is the structure of its content:

    └── ./
      β”œβ”€β”€ existing_tasks # Problems derived from existing data science notebooks on Github/
      β”‚  β”œβ”€β”€ metadata.json # Metadata by `build_existing_tasks_split.py` to reproduce this split.
      β”‚  β”œβ”€β”€ artifacts/ # Folder that stores dependent ML datasets to execute the problems, created by running `build_existing_tasks_split.py`
      β”‚  └── derived_datasets/ # Folder for preprocessed datasets used for prompting experiments.
      β”œβ”€β”€ new_tasks/
      β”‚  β”œβ”€β”€ dataset.json # Original, unprepossessed dataset
      β”‚  β”œβ”€β”€ kaggle_dataset_provenance.csv # Metadata of the Kaggle datasets used to build this split.
      β”‚  β”œβ”€β”€ artifacts/ # Folder that stores dependent ML Kaggle datasets to execute the problems, created by running `build_new_tasks_split.py`
      β”‚  └── derived_datasets/ # Folder for preprocessed datasets used for prompting experiments.
      └── checksums.txt # Table of MD5 checksums of dataset files.
    

    Dataset File Structure

    All the dataset '*.json' files follow the same structure. Each dataset file is a Json-serialized list of Episodes. Each episode corresponds to a notebook annotated with NL-to-code problems. The structure of an episode is documented below:

    {
      "notebook_name": "Name of the notebook.",
      "work_dir": "Path to the dependent data artifacts (e.g., ML datasets) to execute the notebook.",
      "annotator": "Anonymized annotator Id."
      "turns": [
        # A list of natural language to code examples using the current notebook context.
        {
          "input": "Prompt to a code generation model.",
          "turn": {
            "intent": {
              "value": "Annotated NL intent for the current turn.",
              "is_cell_intent": "Metadata used for the existing tasks split to indicate if the code solution is only part of an existing code cell.",
              "cell_idx": "Index of the intent Markdown cell.",
              "line_span": "Line span of the intent.",
              "not_sure": "Annotation confidence.",
              "output_variables": "List of variable names denoting the output. If None, use the output of the last line of code as the output of the problem.",
            },
            "code": {
              "value": "Reference code solution.",
              "cell_idx": "Cell index of the code cell containing the solution.",
              "num_lines": "Number of lines in the reference solution.",
              "line_span": "Line span.",
            },
            "code_context": "Context code (all code cells before this problem) that need to be executed before executing the reference/predicted programs.",
            "delta_code_context": "Delta context code between the last problem in this notebook and the current problem, useful for incremental execution.",
            "metadata": {
              "annotator_id": "Annotator Id",
              "num_code_lines": "Metadata, please ignore.",
              "utterance_without_output_spec": "Annotated NL intent without output specification. Refer to the paper for details.",
            },
          },
          "notebook": "Field intended to store the Json-serialized Jupyter notebook. Not used for now since the notebook can be reconstructed from other metadata in this file.",
          "metadata": {
            # A dict of metadata of this turn.
            "context_cells": [ # A list of context cells before the problem.
              {
                "cell_type": "code|markdown",
                "source": "Cell content."
              },
            ],
            "delta_cell_num": "Number of preceding context cells between the prior turn and the current turn.",
            # The following fields only occur in datasets inlined with schema descriptions.
            "context_cell_num": "Number of context cells in the prompt after inserting schema descriptions and left-truncation.",
            "inten...
    
  18. SQL Injection dataset

    • kaggle.com
    zip
    Updated Jun 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayah Khaldi (2024). SQL Injection dataset [Dataset]. https://www.kaggle.com/datasets/ayahkhaldi/sql-injection-dataset
    Explore at:
    zip(26223784 bytes)Available download formats
    Dataset updated
    Jun 18, 2024
    Authors
    Ayah Khaldi
    Description

    SQL Injection Dataset:

    A cleaned SQL injection dataset, sourced from multiple Kaggle datasets, has been cleaned and split into training, validation, and testing subsets with a 6:2:2 ratio. This dataset is intended for use in research focused on detecting SQL injection attacks.

    Kaggle datasets:
    The dataset contains two columns:
    1. Query: This column represents the SQL query.
    2. Label: This column represents the label for the SQL injection binary classification, where 1 indicates an SQL injection and 0 indicates a non-SQL injection.

      https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F20832724%2Fc0d648e947cec02e35beb665b24b5bdb%2Fsql-injection-datasets.png?generation=1718710914956418&alt=media" alt="">

  19. University FAQ Dataset

    • kaggle.com
    zip
    Updated Jul 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andre Avindra (2023). University FAQ Dataset [Dataset]. https://www.kaggle.com/datasets/andreavindra/university-faq-dataset
    Explore at:
    zip(8402 bytes)Available download formats
    Dataset updated
    Jul 4, 2023
    Authors
    Andre Avindra
    Description

    The dataset is created for a chatbot using deep learning and NLP. This dataset can be used as an input to train deep learning models with NLP techniques, such as natural language processing and deep learning algorithms like neural networks, to develop a chatbot that can understand user conversation patterns and provide appropriate responses.

  20. WikiHow Featured Articles

    • kaggle.com
    zip
    Updated Jun 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elfarouk Etawil (2023). WikiHow Featured Articles [Dataset]. https://www.kaggle.com/datasets/elfarouketawil/wikihow-featured-articles
    Explore at:
    zip(2809434 bytes)Available download formats
    Dataset updated
    Jun 28, 2023
    Authors
    Elfarouk Etawil
    License

    Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
    License information was derived automatically

    Description

    This dataset contains a collection of 997 featured articles from Wikihow, a collaborative platform that provides how-to guides on a wide range of topics. Each record in the dataset represents one article and includes the following six columns:

    Title: the title of the article. Intro: a brief introduction to the article's topic. Article Content: the main content of the article, which provides step-by-step instructions on how to do something. Co-authors: the number of users who have contributed to the article on Wikihow. Updated: the date when the article was last updated on Wikihow. Views: the number of views that the article has received on Wikihow.

    The data was scraped from Wikihow under the Creative Commons Attribution-NonCommercial-ShareAlike (CC BY-NC-SA) license, which allows for non-commercial use of the data as long as proper attribution is given to the original source.

    This dataset can be used for various purposes such as text analysis, natural language processing, and machine learning. Researchers and data analysts can use this dataset to study the characteristics of featured articles on Wikihow, identify patterns or trends in the content or co-authorship, and explore how views and updates correlate with article popularity.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
Organization logo

Meta Kaggle Code

Kaggle's public data on notebook code

Explore at:
zip(167219625372 bytes)Available download formats
Dataset updated
Nov 27, 2025
Dataset authored and provided by
Kagglehttp://kaggle.com/
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Explore our public notebook content!

Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

Why we’re releasing this dataset

By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

Sensitive data

While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

Joining with Meta Kaggle

The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

File organization

The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

Questions / Comments

We love feedback! Let us know in the Discussion tab.

Happy Kaggling!

Search
Clear search
Close search
Google apps
Main menu