Facebook
TwitterThis dataset was created by Regi
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Enriching the Meta-Kaggle dataset using the Meta Kaggle Code to extract all Imports (for both R and Python) and Method Calls (only Python) as lists, which are then added to the KernelVersions.csv file as the columns Imports and MethodCalls.
| Most Imported R Packages | Most Imported Python Packages |
|---|---|
We perform this extraction using the following three regex patterns:
PYTHON_IMPORT_REGEX = re.compile(r'(?:from\s+([a-zA-Z0-9_\.]+)\s+import|import\s+([a-zA-Z0-9_\.]+))')
PYTHON_METHOD_REGEX = *I wish I could add the regex here but kaggle kinda breaks if I do lol*
R_IMPORT_REGEX = re.compile(r'(?:library|require)\((?:[\'"]?)([a-zA-Z0-9_.]+)(?:[\'"]?)\)')
This dataset was created on 06-06-2025. Since the computation required for this process is very resource-intensive and cannot be run on a Kaggle kernel, it is not scheduled. A notebook demonstrating how to create this dataset and what insights it provides can be found here.
Facebook
TwitterThis dataset was created by Alexander Ryzhkov
Facebook
TwitterOpen Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
This dataset contains the python files containing snippets required for the Kaggle kernel - https://www.kaggle.com/code/adeepak7/tensorflow-s-global-and-operation-level-seeds/
Since the kernel is around setting/re-setting global and local level seeds, the nullification of the effect of these seeds in the subsequent cells wasn't possible. Hence, the snippets have been provided as separate python files and these python files are executed independently in the separate cells.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is designed to support research in anomaly detection for OS kernels, particularly in the context of power monitoring systems used in embedded environments. It simulates the interaction between system-level operations and power consumption behaviors, providing a rich set of features for training and evaluating hybrid models.
The dataset contains 1,000 records of yet realistic system behavior, including:
System call sequences
Power usage logs (in watts)
CPU and memory utilization
Process identifiers and names
Timestamps
Labeled entries (Normal or Anomaly)
Anomalies are injected using fuzzy testing principles to simulate abnormal power spikes, syscall irregularities, or excessive resource usage, mimicking real-world kernel faults or malicious activity. This dataset enables the development of robust models that can learn complex, uncertain system behavior patterns for enhanced security and stability of embedded power monitoring applications.
Facebook
TwitterThe reason I did this was because I wanted to know if there was a correlation between Kaggles' top Kernels and Datasets with its popularity. (wanted to know how to get tops, lol). I scrapped the data using DataMiner
top-kernels has:
top-datasets has:
Facebook
TwitterThis dataset was created by Salil Gautam
Facebook
TwitterThis dataset was created by Maksim Filin
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The dataset comprised wheat kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each. The data set can be used for the tasks of classification and cluster analysis.All of these parameters were real-valued continuous
To construct the data, seven geometric parameters of wheat kernels were measured:
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Gautier
Released under CC0: Public Domain
Facebook
TwitterThis dataset was created by joeland209
haha
Facebook
TwitterThis dataset was created by deeplearner
Facebook
TwitterThis dataset was created by Justin Chae
Facebook
TwitterThis dataset was created by ZykoTsai
Facebook
TwitterThis dataset was created by Lavanya Shukla
Facebook
Twitterhttps://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12065493%2Fefa5252a24bf0bdd393156ad5778ed02%2Fkernel_dataclass.png?generation=1669324775217833&alt=media" alt="">
All in all, in the years 2017-2022 1822 kernels used the Kaggle Survey datasets. We have ordered our data into several distinct datasets, each of which was useful in obtaining answers to our questions on at least one of the topics. The obtained datasets are briefly overviewed below.
notebooks.zip
Contains 1822 raw notebooks saved as either ipynb or Rmd. 58 notebooks could not be executed neither in Python nor in R, so they were given the extension unknown_format.txt. The name of each file is the notebook_id as listed on kaggle.com and matches notebook_id in the file all_kernels.csv, which is described below. Among other things, this dataset was used to obtain a per-notebook list of imported libraries, as well as the questions that were addressed by each notebook.
all_kernels.csv
Each row of this dataset contains data about one of the 1822 kernels. The columns correspond to all the fields listed in the Kernel class image above. A more detailed overview of the columns can be found on the dataset's Kaggle page. # TODO
cleaned_kernels.csv
This is in effect the main dataset we used in our competition notebook. We took all_kernels.csv and removed from it 233 rows which described kernels which were just unchanged forks of other kernels.
all_questions.json
Contains all Kaggle Survey questions from the years 2017-2022. In the year 2017, the survey questions were unnumbered, so we numbered them ourselves, keeping the original order and using zero-based indexing. Surveys 2018-2022 have numbered questions, so the index was taken unchanged.
question_map.csv
Looking at survey questions over several years, one can note that certain questions repeat. For example, every year's survey contains a question What is your age. All such repetitions are captured in this dataset. For each unique question, the question number and the survey year where this question appears is given. The question numbers are described in the preceding paragraph sorted_questions_all.json. Certain questions are worded differently but functionally identical. If such questions were joined, a note was added, to alert other users of this dataset.
Facebook
TwitterThis dataset was created by Abraham Anderson
Facebook
TwitterOpen Machine Learning Course mlcourse.ai is designed to perfectly balance theory and practice; therefore, each topic is followed by an assignment with a deadline in a week. You can also take part in several Kaggle Inclass competitions held during the course and write your own tutorials. The next session launches in September, 2019. For more info go to the mlcourse.ai main page. Outline This is the list of published articles on medium.com (English), habr.com (Russian), and jqr.com (Chinese). See Kernels of this Dataset for the same material in English. 1. Exploratory Data Analysis with Pandas uk ru, cn, Kaggle Kernel 2. Visual Data Analysis with Python uk ru, cn, Kaggle Kernels: part1, part2 3. Classification, Decision Trees and k Nearest Neighbors uk, ru, cn, Kaggle Kernel 4. Linear Classification and Regression uk, ru, cn, Kaggle Kernels: part1, part2, part3, part4, part5 5. Bagging and Random Forest uk, ru, cn, Kaggle Kernels: part1, part2, part3 6. Feature Engineering and Feature Selection uk, ru, cn, Kaggle Kernel 7. Unsupervised Learning: Principal Component Analysis and Clustering uk, ru, cn, Kaggle Kernel 8. Vowpal Wabbit: Learning with Gigabytes of Data uk, ru, cn, Kaggle Kernel 9. Time Series Analysis with Python, part 1 uk, ru, cn. Predicting future with Facebook Prophet, part 2 uk, cn Kaggle Kernels: part1, part2 10. Gradient Boosting uk, ru, cn, Kaggle Kernel Assignments Each topic is followed by an assignment. See demo versions in this Dataset. Solutions will be discussed in the upcoming run of the course. Kaggle competitions 1. Catch Me If You Can: Intruder Detection through Webpage Session Tracking. Kaggle Inclass 2. How good is your Medium article? Kaggle Inclass Rating Throughout the course we are maintaining a student rating. It takes into account credits scored in assignments and Kaggle competitions. Top students (according to the final rating) will be listed on a special Wiki page. Community Discussions between students are held in the #mlcourse_ai channel of the OpenDataScience Slack team. A registration form will be shared prior to the start of the new session Collaboration You can publish Kernels using this Dataset. But please respect others' interests: don't share solutions to assignments and well-performing solutions for Kaggle Inclass competitions. If you notice any typos/errors in course material, please open an Issue or make a pull request in the course repo. The course is free but you can support organizers by making a pledge on Patreon (monthly support) or a one-time payment on Ko-fi
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Darien Schettler
Released under CC0: Public Domain
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.
The original dataset is organized into multiple CSV files, each containing structured data on different entities:
Table 1. code_blocks.csv structure
| Column | Description |
| code_blocks_index | Global index linking code blocks to markup_data.csv. |
| kernel_id | Identifier for the Kaggle Jupyter notebook from which the code block was extracted. |
| code_block_id |
Position of the code block within the notebook. |
| code_block |
The actual machine learning code snippet. |
Table 2. kernels_meta.csv structure
| Column | Description |
| kernel_id | Identifier for the Kaggle Jupyter notebook. |
| kaggle_score | Performance metric of the notebook. |
| kaggle_comments | Number of comments on the notebook. |
| kaggle_upvotes | Number of upvotes the notebook received. |
| kernel_link | URL to the notebook. |
| comp_name | Name of the associated Kaggle competition. |
Table 3. competitions_meta.csv structure
| Column | Description |
| comp_name | Name of the Kaggle competition. |
| description | Overview of the competition task. |
| data_type | Type of data used in the competition. |
| comp_type | Classification of the competition. |
| subtitle | Short description of the task. |
| EvaluationAlgorithmAbbreviation | Metric used for assessing competition submissions. |
| data_sources | Links to datasets used. |
| metric type | Class label for the assessment metric. |
Table 4. markup_data.csv structure
| Column | Description |
| code_block | Machine learning code block. |
| too_long | Flag indicating whether the block spans multiple semantic types. |
| marks | Confidence level of the annotation. |
| graph_vertex_id | ID of the semantic type. |
The dataset allows mapping between these tables. For example:
kernel_id column.comp_name. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index column.
The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.
Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank), providing additional context for evaluation.
competitions_meta_2.csv is enriched with data_cards, decsribing the data used in the competitions.
The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:
Facebook
TwitterThis dataset was created by Regi