Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Enriching the Meta-Kaggle dataset using the Meta Kaggle Code to extract all Imports (for both R and Python) and Method Calls (only Python) as lists, which are then added to the KernelVersions.csv file as the columns Imports and MethodCalls.
| Most Imported R Packages | Most Imported Python Packages |
|---|---|
We perform this extraction using the following three regex patterns:
PYTHON_IMPORT_REGEX = re.compile(r'(?:from\s+([a-zA-Z0-9_\.]+)\s+import|import\s+([a-zA-Z0-9_\.]+))')
PYTHON_METHOD_REGEX = *I wish I could add the regex here but kaggle kinda breaks if I do lol*
R_IMPORT_REGEX = re.compile(r'(?:library|require)\((?:[\'"]?)([a-zA-Z0-9_.]+)(?:[\'"]?)\)')
This dataset was created on 06-06-2025. Since the computation required for this process is very resource-intensive and cannot be run on a Kaggle kernel, it is not scheduled. A notebook demonstrating how to create this dataset and what insights it provides can be found here.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.
The original dataset is organized into multiple CSV files, each containing structured data on different entities:
Table 1. code_blocks.csv structure
| Column | Description |
| code_blocks_index | Global index linking code blocks to markup_data.csv. |
| kernel_id | Identifier for the Kaggle Jupyter notebook from which the code block was extracted. |
| code_block_id |
Position of the code block within the notebook. |
| code_block |
The actual machine learning code snippet. |
Table 2. kernels_meta.csv structure
| Column | Description |
| kernel_id | Identifier for the Kaggle Jupyter notebook. |
| kaggle_score | Performance metric of the notebook. |
| kaggle_comments | Number of comments on the notebook. |
| kaggle_upvotes | Number of upvotes the notebook received. |
| kernel_link | URL to the notebook. |
| comp_name | Name of the associated Kaggle competition. |
Table 3. competitions_meta.csv structure
| Column | Description |
| comp_name | Name of the Kaggle competition. |
| description | Overview of the competition task. |
| data_type | Type of data used in the competition. |
| comp_type | Classification of the competition. |
| subtitle | Short description of the task. |
| EvaluationAlgorithmAbbreviation | Metric used for assessing competition submissions. |
| data_sources | Links to datasets used. |
| metric type | Class label for the assessment metric. |
Table 4. markup_data.csv structure
| Column | Description |
| code_block | Machine learning code block. |
| too_long | Flag indicating whether the block spans multiple semantic types. |
| marks | Confidence level of the annotation. |
| graph_vertex_id | ID of the semantic type. |
The dataset allows mapping between these tables. For example:
kernel_id column.comp_name. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index column.
The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.
Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank), providing additional context for evaluation.
competitions_meta_2.csv is enriched with data_cards, decsribing the data used in the competitions.
The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:
Facebook
TwitterThis dataset was created by Alexander Ryzhkov
Facebook
TwitterThis dataset was created by Regi
Facebook
TwitterThe reason I did this was because I wanted to know if there was a correlation between Kaggles' top Kernels and Datasets with its popularity. (wanted to know how to get tops, lol). I scrapped the data using DataMiner
top-kernels has:
top-datasets has:
Facebook
TwitterThis dataset was created by deeplearner
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is designed to support research in anomaly detection for OS kernels, particularly in the context of power monitoring systems used in embedded environments. It simulates the interaction between system-level operations and power consumption behaviors, providing a rich set of features for training and evaluating hybrid models.
The dataset contains 1,000 records of yet realistic system behavior, including:
System call sequences
Power usage logs (in watts)
CPU and memory utilization
Process identifiers and names
Timestamps
Labeled entries (Normal or Anomaly)
Anomalies are injected using fuzzy testing principles to simulate abnormal power spikes, syscall irregularities, or excessive resource usage, mimicking real-world kernel faults or malicious activity. This dataset enables the development of robust models that can learn complex, uncertain system behavior patterns for enhanced security and stability of embedded power monitoring applications.
Facebook
TwitterCollections of kernels submissions for the Kaggle survey competitions from 2017 to 2022. As this data was collected during the 2022 survey competition, it does not contain all the kernels for year 2022 .
Facebook
TwitterOpen Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
This dataset contains the python files containing snippets required for the Kaggle kernel - https://www.kaggle.com/code/adeepak7/tensorflow-s-global-and-operation-level-seeds/
Since the kernel is around setting/re-setting global and local level seeds, the nullification of the effect of these seeds in the subsequent cells wasn't possible. Hence, the snippets have been provided as separate python files and these python files are executed independently in the separate cells.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Explore our public data on competitions, datasets, kernels (code / notebooks) and more Meta Kaggle may not be the Rosetta Stone of data science, but we do think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggleβs community and activity.
Strategizing to become a Competitions Grandmaster? Wondering who, where, and what goes into a winning team? Choosing evaluation metrics for your next data science project? The kernels published using this data can help. We also hope they'll spark some lively Kaggler conversations and be a useful resource for the larger data science community.
https://i.imgur.com/2Egeb8R.png" alt="" title="a title">
This dataset is made available as CSV files through Kaggle Kernels. It contains tables on public activity from Competitions, Datasets, Kernels, Discussions, and more. The tables are updated daily.
Please note: This data is not a complete dump of our database. Rows, columns, and tables have been filtered out and transformed.
In August 2023, we released Meta Kaggle for Code, a companion to Meta Kaggle containing public, Apache 2.0 licensed notebook data. View the dataset and instructions for how to join it with Meta Kaggle here
We also updated the license on Meta Kaggle from CC-BY-NC-SA to Apache 2.0.
UserId column in the ForumMessages table has values that do not exist in the Users table.True or False.Total columns.
For example, the DatasetCount is not the total number of datasets with the Tag according to the DatasetTags table.db_abd_create_tables.sql script.clean_data.py script.
The script does the following steps for each table:
NULL.add_foreign_keys.sql script.Total columns in the database tables. I do that by running the update_totals.sql script.
Facebook
TwitterThis dataset was created by Ravi Bharathi
Released under Data files Β© Original Authors
Facebook
Twitterhttps://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12065493%2Fefa5252a24bf0bdd393156ad5778ed02%2Fkernel_dataclass.png?generation=1669324775217833&alt=media" alt="">
All in all, in the years 2017-2022 1822 kernels used the Kaggle Survey datasets. We have ordered our data into several distinct datasets, each of which was useful in obtaining answers to our questions on at least one of the topics. The obtained datasets are briefly overviewed below.
notebooks.zip
Contains 1822 raw notebooks saved as either ipynb or Rmd. 58 notebooks could not be executed neither in Python nor in R, so they were given the extension unknown_format.txt. The name of each file is the notebook_id as listed on kaggle.com and matches notebook_id in the file all_kernels.csv, which is described below. Among other things, this dataset was used to obtain a per-notebook list of imported libraries, as well as the questions that were addressed by each notebook.
all_kernels.csv
Each row of this dataset contains data about one of the 1822 kernels. The columns correspond to all the fields listed in the Kernel class image above. A more detailed overview of the columns can be found on the dataset's Kaggle page. # TODO
cleaned_kernels.csv
This is in effect the main dataset we used in our competition notebook. We took all_kernels.csv and removed from it 233 rows which described kernels which were just unchanged forks of other kernels.
all_questions.json
Contains all Kaggle Survey questions from the years 2017-2022. In the year 2017, the survey questions were unnumbered, so we numbered them ourselves, keeping the original order and using zero-based indexing. Surveys 2018-2022 have numbered questions, so the index was taken unchanged.
question_map.csv
Looking at survey questions over several years, one can note that certain questions repeat. For example, every year's survey contains a question What is your age. All such repetitions are captured in this dataset. For each unique question, the question number and the survey year where this question appears is given. The question numbers are described in the preceding paragraph sorted_questions_all.json. Certain questions are worded differently but functionally identical. If such questions were joined, a note was added, to alert other users of this dataset.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13367141%2F444a868e669671faf9007822d6f2d348%2FAdd%20a%20heading.png?generation=1731775788329917&alt=media" alt="">
This dataset provides comprehensive metadata on various Kaggle datasets, offering detailed information about the dataset owners, creators, usage statistics, licensing, and more. It can help researchers, data scientists, and Kaggle enthusiasts quickly analyze the key attributes of different datasets on Kaggle. π
datasetUrl π: The URL of the Kaggle dataset page. This directs you to the specific dataset's page on Kaggle.
ownerAvatarUrl πΌοΈ: The URL of the dataset owner's profile avatar on Kaggle.
ownerName π€: The name of the dataset owner. This can be the individual or organization that created and maintains the dataset.
ownerUrl π: A link to the Kaggle profile page of the dataset owner.
ownerUserId πΌ: The unique user ID of the dataset owner on Kaggle.
ownerTier ποΈ: The ownership tier, such as "Tier 1" or "Tier 2," indicating the owner's status or level on Kaggle.
creatorName π©βπ»: The name of the dataset creator, which could be different from the owner.
creatorUrl π: A link to the Kaggle profile page of the dataset creator.
creatorUserId πΌ: The unique user ID of the dataset creator.
scriptCount π: The number of scripts (kernels) associated with this dataset.
scriptsUrl π: A link to the scripts (kernels) page for the dataset, where you can explore related code.
forumUrl π¬: The URL to the discussion forum for this dataset, where users can ask questions and share insights.
viewCount π: The number of views the dataset page has received on Kaggle.
downloadCount β¬οΈ: The number of times the dataset has been downloaded by users.
dateCreated π
: The date when the dataset was first created and uploaded to Kaggle.
dateUpdated π: The date when the dataset was last updated or modified.
voteButton π: The metadata for the dataset's vote button, showing how users interact with the dataset's quality ratings.
categories π·οΈ: The categories or tags associated with the dataset, helping users filter datasets based on topics of interest (e.g., "Healthcare," "Finance").
licenseName π‘οΈ: The name of the license under which the dataset is shared (e.g., "CC0," "MIT License").
licenseShortName π: A short form or abbreviation of the dataset's license name (e.g., "CC0" for Creative Commons Zero).
datasetSize π¦: The size of the dataset in terms of storage, typically measured in MB or GB.
commonFileTypes π: A list of common file types included in the dataset (e.g., .csv, .json, .xlsx).
downloadUrl β¬οΈ: A direct link to download the dataset files.
newKernelNotebookUrl π: A link to a new kernel or notebook related to this dataset, for those who wish to explore it programmatically.
newKernelScriptUrl π»: A link to a new script for running computations or processing data related to the dataset.
usabilityRating π: A rating or score representing how usable the dataset is, based on user feedback.
firestorePath π: A reference to the path in Firestore where this datasetβs metadata is stored.
datasetSlug π·οΈ: A URL-friendly version of the dataset name, typically used for URLs.
rank π: The dataset's rank based on certain metrics (e.g., downloads, votes, views).
datasource π: The source or origin of the dataset (e.g., government data, private organizations).
medalUrl π
: A URL pointing to the dataset's medal or badge, indicating the dataset's quality or relevance.
hasHashLink π: Indicates whether the dataset has a hash link for verifying data integrity.
ownerOrganizationId π’: The unique organization ID of the dataset's owner if the owner is an organization rather than an individual.
totalVotes π³οΈ: The total number of votes the dataset has received from users, reflecting its popularity or quality.
category_names π: A comma-separated string of category names that represent the datasetβs classification.
This dataset is a valuable resource for those who want to analyze Kaggle's ecosystem, discover high-quality datasets, and explore metadata in a structured way. ππ
Facebook
TwitterThis isn't a dataset, it is a collection of kernels written on Kaggle that use no data at all.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Kaggle kernels doesn't have the package, pystacknet, so we created a dataset containing it for the Petfinder Competition
Code from: https://github.com/h2oai/pystacknet
@bkkaggle (https://www.kaggle.com/bkkaggle) helped with creating the dataset
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The dataset comprised wheat kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each. The data set can be used for the tasks of classification and cluster analysis.All of these parameters were real-valued continuous
To construct the data, seven geometric parameters of wheat kernels were measured:
Facebook
TwitterThis dataset was created by Maksim Filin
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is created using the below reference-
https://archive.ics.uci.edu/dataset/1/abalone
We import the corresponding repository in a Kaggle kernel and populate the dataset thereby. Users may choose to import the corresponding dataset with a simple read_csv in pandas and proceed with the solution.
Best wishes!
Facebook
TwitterThis dataset was created by Justin Chae
Facebook
TwitterThis dataset was created by KlemenVodopivec
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Enriching the Meta-Kaggle dataset using the Meta Kaggle Code to extract all Imports (for both R and Python) and Method Calls (only Python) as lists, which are then added to the KernelVersions.csv file as the columns Imports and MethodCalls.
| Most Imported R Packages | Most Imported Python Packages |
|---|---|
We perform this extraction using the following three regex patterns:
PYTHON_IMPORT_REGEX = re.compile(r'(?:from\s+([a-zA-Z0-9_\.]+)\s+import|import\s+([a-zA-Z0-9_\.]+))')
PYTHON_METHOD_REGEX = *I wish I could add the regex here but kaggle kinda breaks if I do lol*
R_IMPORT_REGEX = re.compile(r'(?:library|require)\((?:[\'"]?)([a-zA-Z0-9_.]+)(?:[\'"]?)\)')
This dataset was created on 06-06-2025. Since the computation required for this process is very resource-intensive and cannot be run on a Kaggle kernel, it is not scheduled. A notebook demonstrating how to create this dataset and what insights it provides can be found here.