Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.
By collecting all of this code created by Kaggleβs community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.
Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.
The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the codeβs author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!
While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.
The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.
The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.
The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays
We love feedback! Let us know in the Discussion tab.
Happy Kaggling!
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
KGTorrent is a dataset of Python Jupyter notebooks from the Kaggle platform.
The dataset is accompanied by a MySQL database containing metadata about the notebooks and the activity of Kaggle users on the platform. The information to build the MySQL database has been derived from Meta Kaggle, a publicly available dataset containing Kaggle metadata.
In this package, we share the complete KGTorrent dataset (consisting of the dataset itself plus its companion database), as well as the specific version of Meta Kaggle used to build the database.
More specifically, the package comprises the following three compressed archives:
KGT_dataset.tar.bz2, the dataset of Jupyter notebooks;
KGTorrent_dump_10-2020.sql.tar.bz2, the dump of the MySQL companion database;
MetaKaggle27Oct2020.tar.bz2, a copy of the Meta Kaggle version used to build the database.
Moreover, we include KGTorrent_logical_schema.pdf, the logical schema of the KGTorrent MySQL database.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.
Kaggle got its start in 2010 by offering machine learning competitions and now also offers a public data platform, a cloud-based workbench for data science, and Artificial Intelligence education. Its key personnel were Anthony Goldbloom and Jeremy Howard. Nicholas Gruen was founding chair succeeded by Max Levchin. Equity was raised in 2011 valuing the company at $25 million. On 8 March 2017, Google announced that they were acquiring Kaggle.[1][2]
Source: Kaggle
Facebook
TwitterGholamreza/test-dataset-kaggle dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterThis dataset contains 1000000 rows of realistic student performance data, designed for beginners in Machine Learning to practice Linear Regression, model training, and evaluation techniques.
Each row represents one student with features like study hours, attendance, class participation, and final score.
The dataset is small, clean, and structured to be beginner-friendly.
Random noise simulates differences in learning ability, motivation, etc.
Regression Tasks
total_score from weekly_self_study_hours. attendance_percentage and class_participation. Classification Tasks
grade (AβF) using study hours, attendance, and participation. Model Evaluation Practice
β This dataset is intentionally kept simple, so that new ML learners can clearly see the relationship between input features (study, attendance, participation) and output (score/grade).
Facebook
Twitterhttps://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/
Date: 2022-07-10 Files: ner_dataset.csv Source: Kaggle entity annotated corpus notes: The dataset only contains the tokens and ner tag labels. Labels are uppercase.
About Dataset
from Kaggle Datasets
Context
Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set. Tip: Use Pandas Dataframe to load dataset if using Python for⦠See the full description on the dataset page: https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13367141%2F444a868e669671faf9007822d6f2d348%2FAdd%20a%20heading.png?generation=1731775788329917&alt=media" alt="">
This dataset provides comprehensive metadata on various Kaggle datasets, offering detailed information about the dataset owners, creators, usage statistics, licensing, and more. It can help researchers, data scientists, and Kaggle enthusiasts quickly analyze the key attributes of different datasets on Kaggle. π
datasetUrl π: The URL of the Kaggle dataset page. This directs you to the specific dataset's page on Kaggle.
ownerAvatarUrl πΌοΈ: The URL of the dataset owner's profile avatar on Kaggle.
ownerName π€: The name of the dataset owner. This can be the individual or organization that created and maintains the dataset.
ownerUrl π: A link to the Kaggle profile page of the dataset owner.
ownerUserId πΌ: The unique user ID of the dataset owner on Kaggle.
ownerTier ποΈ: The ownership tier, such as "Tier 1" or "Tier 2," indicating the owner's status or level on Kaggle.
creatorName π©βπ»: The name of the dataset creator, which could be different from the owner.
creatorUrl π: A link to the Kaggle profile page of the dataset creator.
creatorUserId πΌ: The unique user ID of the dataset creator.
scriptCount π: The number of scripts (kernels) associated with this dataset.
scriptsUrl π: A link to the scripts (kernels) page for the dataset, where you can explore related code.
forumUrl π¬: The URL to the discussion forum for this dataset, where users can ask questions and share insights.
viewCount π: The number of views the dataset page has received on Kaggle.
downloadCount β¬οΈ: The number of times the dataset has been downloaded by users.
dateCreated π
: The date when the dataset was first created and uploaded to Kaggle.
dateUpdated π: The date when the dataset was last updated or modified.
voteButton π: The metadata for the dataset's vote button, showing how users interact with the dataset's quality ratings.
categories π·οΈ: The categories or tags associated with the dataset, helping users filter datasets based on topics of interest (e.g., "Healthcare," "Finance").
licenseName π‘οΈ: The name of the license under which the dataset is shared (e.g., "CC0," "MIT License").
licenseShortName π: A short form or abbreviation of the dataset's license name (e.g., "CC0" for Creative Commons Zero).
datasetSize π¦: The size of the dataset in terms of storage, typically measured in MB or GB.
commonFileTypes π: A list of common file types included in the dataset (e.g., .csv, .json, .xlsx).
downloadUrl β¬οΈ: A direct link to download the dataset files.
newKernelNotebookUrl π: A link to a new kernel or notebook related to this dataset, for those who wish to explore it programmatically.
newKernelScriptUrl π»: A link to a new script for running computations or processing data related to the dataset.
usabilityRating π: A rating or score representing how usable the dataset is, based on user feedback.
firestorePath π: A reference to the path in Firestore where this datasetβs metadata is stored.
datasetSlug π·οΈ: A URL-friendly version of the dataset name, typically used for URLs.
rank π: The dataset's rank based on certain metrics (e.g., downloads, votes, views).
datasource π: The source or origin of the dataset (e.g., government data, private organizations).
medalUrl π
: A URL pointing to the dataset's medal or badge, indicating the dataset's quality or relevance.
hasHashLink π: Indicates whether the dataset has a hash link for verifying data integrity.
ownerOrganizationId π’: The unique organization ID of the dataset's owner if the owner is an organization rather than an individual.
totalVotes π³οΈ: The total number of votes the dataset has received from users, reflecting its popularity or quality.
category_names π: A comma-separated string of category names that represent the datasetβs classification.
This dataset is a valuable resource for those who want to analyze Kaggle's ecosystem, discover high-quality datasets, and explore metadata in a structured way. ππ
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Kaggle Annotation is a dataset for object detection tasks - it contains Objects annotations for 965 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset can be used to train an Open Book model for Kaggle's LLM Science Exam competition. This dataset was generated by searching and concatenating all publicly shared datasets on Sept 1 2023.
The context column was generated using Mgoksu's notebook here with NUM_TITLES=5 and NUM_SENTENCES=20
The source column indicates where the dataset originated. Below are the sources:
source = 1 & 2 * Radek's 6.5k dataset. Discussion here annd here, dataset here.
source = 3 & 4 * Radek's 15k + 5.9k. Discussion here and here, dataset here
source = 5 & 6 * Radek's 6k + 6k. Discussion here and here, dataset here
source = 7 * Leonid's 1k. Discussion here, dataset here
source = 8 * Gigkpeaeums 3k. Discussion here, dataset here
source = 9 * Anil 3.4k. Discussion here, dataset here
source = 10, 11, 12 * Mgoksu 13k. Discussion here, dataset here
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset offers insights into how daily technology usage, including social media and screen time, impacts mental health. π It captures various behavioral patterns and their correlations with mental health indicators like stress levels, sleep quality, and productivity. Dive in to analyze the relationship between our digital lives and mental wellness! π
The data is useful for research, academic projects, or building predictive models to understand trends in mental health influenced by screen time and technology habits. ππ
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Learning Resources Database is a catalog of interactive tutorials, videos, online classes, finding aids, and other instructional resources on National Library of Medicine (NLM) products and services. Resources may be available for immediate use via a browser or downloadable for use in course management systems
Dataset DescriptionIt contains 520 rows and 13 variables as listed below - - Resource ID : Alphanumeric identifier - Resource Name : Title of the resource - Resource URL : Link of the resource - Description : Brief explanation on the reource - Archived : Flagged as False for all data points - Format : Format of the resource ex. HTML, PDF, MP4 video , MS Word, Powerpoint etc. - Type : Type of the resource ex Webinar, document, tutorial, slides etc. - Runtime : Runtime of the resource - Subject Areas : Topic covered in reource - Authoring Organization : Name of the Authoring Organization - Intended Audiences : Profile of the intended audience - Record Modified : Timestamp info on record last modification - Resource Revised : Timestamp info on resource last modified
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Context
Kaggle is one of the largest communities of data scientists and machine learning practitioners in the world, and its platform hosts thousands of datasets covering a wide range of topics and industries. With so many options to choose from, it can be difficult to know where to start or what datasets are worth exploring. That's where this dataset comes in. By scraping information about the top 10,000 datasets on Kaggle, we have created a single source of truth for the most popular and useful datasets on the platform. This dataset is not just a list of names and numbers, but a valuable tool for data enthusiasts and professionals alike, providing insights into the latest trends and techniques in data science and machine learning
Column description - Dataset_name - Name of the dataset - Author_name - Name of the author - Author_id - Kaggle id of the author - No_of_files - Number of files the author has uploaded - size - Size of all the files - Type_of_file - Type of the files such as csv, json etc. - Upvotes - Total upvotes of the dataset - Medals - Medal of the dataset - Usability - Usability of the dataset - Date - Date in which the dataset is uploaded - Day - Day in which the dataset is uploaded - Time - Time in which the dataset is uploaded - Dataset_link - Kaggle link of the dataset
Acknowledgements The data has been scraped from the official Kaggle Website and is available under the Creative Common License.
Enjoy & Keep Learning !!!
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
I've always wanted to explore Kaggle's Meta Kaggle dataset but I am more comfortable on using TSQL when it comes to writing (very) complex queries. Also, I tend to write queries faster when using SQL MANAGEMENT STUDIO, like 100x faster. So, I ported Kaggle's Meta Kaggle dataset into MS SQL SERVER 2022 database format, created a backup file, then uploaded it here.
Explore Kaggle's public data on competitions, datasets, kernels (code/ notebooks) and more Meta Kaggle may not be the Rosetta Stone of data science, but they think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggleβs community and activity.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1842206%2F2ad97bce7839d6e57674e7a82981ed23%2F2Egeb8R.png?generation=1688912953875842&alt=media" alt="">
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This dataset was originally collected for a data science and machine learning project that aimed at investigating the potential correlation between the amount of time an individual spends on social media and the impact it has on their mental health.
The project involves conducting a survey to collect data, organizing the data, and using machine learning techniques to create a predictive model that can determine whether a person should seek professional help based on their answers to the survey questions.
This project was completed as part of a Statistics course at a university, and the team is currently in the process of writing a report and completing a paper that summarizes and discusses the findings in relation to other research on the topic.
The following is the Google Colab link to the project, done on Jupyter Notebook -
https://colab.research.google.com/drive/1p7P6lL1QUw1TtyUD1odNR4M6TVJK7IYN
The following is the GitHub Repository of the project -
https://github.com/daerkns/social-media-and-mental-health
Libraries used for the Project -
Pandas
Numpy
Matplotlib
Seaborn
Sci-kit Learn
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Note: This is a work in progress, and not all the Kaggle forums are included in this dataset. The remaining forums will be added when I end solving some issues with the data generators related to these forums.
Welcome to the Kaggle Forum Discussions dataset!. This dataset contains curated data about recent discussions opened in the different forums on Kaggle. The data is obtained through web scraping techniques, using the selenium libraries, and converting text data into markdown style using the markdownify package.
This dataset contains information about the discussion main topic, topic title, comments, votes, medals and more, and is designed to serve as a complement to the data available on the Kaggle meta dataset, specifically for recent discussions. Keep reading to see the details.
As a dynamic website that relies heavily in JavaScript (JS), I extracted the data in this dataset through web scraping techniques using the selenium library.
The functions and classes used to scrape the data on Kaggle where stored on a utility script publicly available here. As JS-generated pages like Kaggle are unstable where trying to scrape them, the mentioned script implements capabilities for retrying connections and to await for elements to appear.
Each Forum was scrapped using a one notebook for each, then the mentioned notebooks were connected to a central notebook that generates this dataset. Also the discussions are scrapped in parallel so to enhance speed. This dataset represents all the data that can be gathered in a single notebook session, from the most recent to the most old.
If you need more control on the data you want to research, feel free to import all you need from the utility script mentioned before.
This dataset contains several folders, each named as the discussion forum they contain data about. For example, the 'competition-hosting' folder contains data about the Competition Hosting forum. Inside each folder, you'll find two files: one is a csv file and the other a json file.
The json file (in Python, represented as a dictionary) is indexed with the ID that Kaggle assigns to the mentioned discussion. Each ID is paired with its corresponding discussion, which is represented as a nested dictionary (the discussion dict), which contains the following fields: - title: The title of the main topic. - content: Content of the main topic. - tags: List containing the discussion's tags. - datetime: Date and time at which the discussion was published (in ISO 8601 format). - votes: Number of votes gotten by the discussion. - medal: Medal awarded by the main topic (if any). - user: User that published the main topic. - expertise: Publisher's expertise, measured by the Kaggle progression system. - n_comments: Total number of comments in the current discussion. - n_appreciation_comments: Total number of appreciation comments in the current discussion. - comments: Dictionary containing data about the comments in the discussion. Each comment is indexed by an ID assigned by Kaggle, containing the following fields: - content: Comment's content. - is_appreciation: Wether the comment is of appreciation. - is_deleted: Wether the comment was deleted. - n_replies: Number of replies to the comment. - datetime: Date and time at which the comment was published (in ISO 8601 format). - votes: Number of votes gotten by the current comment. - medal: Medal awarded by the comment (if any). - user: User that published the comment. - expertise: Publisher's expertise, measured by the Kaggle progression system. - n_deleted: Total number of deleted replies (including self). - replies: A dict following this same format.
By other side, the csv file serves as a summary of the json file, containing information about the comments limited to the hottest and most voted comments.
Note: Only the 'content' field is mandatory for each discussion. The availability of the other fields is subject to the stability of the scraping tasks, which may also affect the update frequency.
Facebook
TwitterThis dataset was collected by a edtech startup. The startup is into teaching entrepreneurial life-skills in animated-gamified format through its video series to kids between the age group of 6-14 years. Through its learning management system the company tracks the progress made by all of its subscribers on the platform. Company records platform content usage activity data and tries to follow up with parents if there is any inactiveness on the platform by their child. Here's more information about the dataset
There is some missing data as well. I hope it would be good dataset for beginners practicing their NLP skills.
Image by Steven Weirather from Pixabay
Facebook
TwitterArcade is a collection of natural language to code problems on interactive data science notebooks. Each problem features an NL intent as problem specification, a reference code solution, and preceding notebook context (Markdown or code cells). Arcade can be used to evaluate the accuracies of code large language models in generating data science programs given natural language instructions. Please read our paper for more details.
Noteπ This Kaggle dataset only contains the dataset files of Arcade. Refer to our main Github repository for detailed instructions to use this dataset.
Below is the structure of its content:
βββ ./
βββ existing_tasks # Problems derived from existing data science notebooks on Github/
β βββ metadata.json # Metadata by `build_existing_tasks_split.py` to reproduce this split.
β βββ artifacts/ # Folder that stores dependent ML datasets to execute the problems, created by running `build_existing_tasks_split.py`
β βββ derived_datasets/ # Folder for preprocessed datasets used for prompting experiments.
βββ new_tasks/
β βββ dataset.json # Original, unprepossessed dataset
β βββ kaggle_dataset_provenance.csv # Metadata of the Kaggle datasets used to build this split.
β βββ artifacts/ # Folder that stores dependent ML Kaggle datasets to execute the problems, created by running `build_new_tasks_split.py`
β βββ derived_datasets/ # Folder for preprocessed datasets used for prompting experiments.
βββ checksums.txt # Table of MD5 checksums of dataset files.
All the dataset '*.json' files follow the same structure. Each dataset file is a Json-serialized list of Episodes. Each episode corresponds to a notebook annotated with NL-to-code problems. The structure of an episode is documented below:
{
"notebook_name": "Name of the notebook.",
"work_dir": "Path to the dependent data artifacts (e.g., ML datasets) to execute the notebook.",
"annotator": "Anonymized annotator Id."
"turns": [
# A list of natural language to code examples using the current notebook context.
{
"input": "Prompt to a code generation model.",
"turn": {
"intent": {
"value": "Annotated NL intent for the current turn.",
"is_cell_intent": "Metadata used for the existing tasks split to indicate if the code solution is only part of an existing code cell.",
"cell_idx": "Index of the intent Markdown cell.",
"line_span": "Line span of the intent.",
"not_sure": "Annotation confidence.",
"output_variables": "List of variable names denoting the output. If None, use the output of the last line of code as the output of the problem.",
},
"code": {
"value": "Reference code solution.",
"cell_idx": "Cell index of the code cell containing the solution.",
"num_lines": "Number of lines in the reference solution.",
"line_span": "Line span.",
},
"code_context": "Context code (all code cells before this problem) that need to be executed before executing the reference/predicted programs.",
"delta_code_context": "Delta context code between the last problem in this notebook and the current problem, useful for incremental execution.",
"metadata": {
"annotator_id": "Annotator Id",
"num_code_lines": "Metadata, please ignore.",
"utterance_without_output_spec": "Annotated NL intent without output specification. Refer to the paper for details.",
},
},
"notebook": "Field intended to store the Json-serialized Jupyter notebook. Not used for now since the notebook can be reconstructed from other metadata in this file.",
"metadata": {
# A dict of metadata of this turn.
"context_cells": [ # A list of context cells before the problem.
{
"cell_type": "code|markdown",
"source": "Cell content."
},
],
"delta_cell_num": "Number of preceding context cells between the prior turn and the current turn.",
# The following fields only occur in datasets inlined with schema descriptions.
"context_cell_num": "Number of context cells in the prompt after inserting schema descriptions and left-truncation.",
"inten...
Facebook
TwitterA cleaned SQL injection dataset, sourced from multiple Kaggle datasets, has been cleaned and split into training, validation, and testing subsets with a 6:2:2 ratio. This dataset is intended for use in research focused on detecting SQL injection attacks.
Label: This column represents the label for the SQL injection binary classification, where 1 indicates an SQL injection and 0 indicates a non-SQL injection.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F20832724%2Fc0d648e947cec02e35beb665b24b5bdb%2Fsql-injection-datasets.png?generation=1718710914956418&alt=media" alt="">
Facebook
TwitterThe dataset is created for a chatbot using deep learning and NLP. This dataset can be used as an input to train deep learning models with NLP techniques, such as natural language processing and deep learning algorithms like neural networks, to develop a chatbot that can understand user conversation patterns and provide appropriate responses.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
This dataset contains a collection of 997 featured articles from Wikihow, a collaborative platform that provides how-to guides on a wide range of topics. Each record in the dataset represents one article and includes the following six columns:
Title: the title of the article. Intro: a brief introduction to the article's topic. Article Content: the main content of the article, which provides step-by-step instructions on how to do something. Co-authors: the number of users who have contributed to the article on Wikihow. Updated: the date when the article was last updated on Wikihow. Views: the number of views that the article has received on Wikihow.
The data was scraped from Wikihow under the Creative Commons Attribution-NonCommercial-ShareAlike (CC BY-NC-SA) license, which allows for non-commercial use of the data as long as proper attribution is given to the original source.
This dataset can be used for various purposes such as text analysis, natural language processing, and machine learning. Researchers and data analysts can use this dataset to study the characteristics of featured articles on Wikihow, identify patterns or trends in the content or co-authorship, and explore how views and updates correlate with article popularity.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.
By collecting all of this code created by Kaggleβs community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.
Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.
The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the codeβs author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!
While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.
The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.
The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.
The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays
We love feedback! Let us know in the Discussion tab.
Happy Kaggling!