Facebook
TwitterAttribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
This dataset is about Python programming. Question and answers are generated using Gemma. There are more than four hundred questions and their corresponding answers about Python programming.
Questions are ranging from concepts like data-types, variables and keywords to regular-expression and threading.
I have used this dataset here
The code used for dataset generated is available here
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
KGTorrent is a dataset of Python Jupyter notebooks from the Kaggle platform.
The dataset is accompanied by a MySQL database containing metadata about the notebooks and the activity of Kaggle users on the platform. The information to build the MySQL database has been derived from Meta Kaggle, a publicly available dataset containing Kaggle metadata.
In this package, we share the complete KGTorrent dataset (consisting of the dataset itself plus its companion database), as well as the specific version of Meta Kaggle used to build the database.
More specifically, the package comprises the following three compressed archives:
KGT_dataset.tar.bz2, the dataset of Jupyter notebooks;
KGTorrent_dump_10-2020.sql.tar.bz2, the dump of the MySQL companion database;
MetaKaggle27Oct2020.tar.bz2, a copy of the Meta Kaggle version used to build the database.
Moreover, we include KGTorrent_logical_schema.pdf, the logical schema of the KGTorrent MySQL database.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.
By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.
Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.
The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!
While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.
The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.
The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.
The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays
We love feedback! Let us know in the Discussion tab.
Happy Kaggling!
Facebook
TwitterThis dataset was created by Jordan Tantuico
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Enriching the Meta-Kaggle dataset using the Meta Kaggle Code to extract all Imports (for both R and Python) and Method Calls (only Python) as lists, which are then added to the KernelVersions.csv file as the columns Imports and MethodCalls.
| Most Imported R Packages | Most Imported Python Packages |
|---|---|
We perform this extraction using the following three regex patterns:
PYTHON_IMPORT_REGEX = re.compile(r'(?:from\s+([a-zA-Z0-9_\.]+)\s+import|import\s+([a-zA-Z0-9_\.]+))')
PYTHON_METHOD_REGEX = *I wish I could add the regex here but kaggle kinda breaks if I do lol*
R_IMPORT_REGEX = re.compile(r'(?:library|require)\((?:[\'"]?)([a-zA-Z0-9_.]+)(?:[\'"]?)\)')
This dataset was created on 06-06-2025. Since the computation required for this process is very resource-intensive and cannot be run on a Kaggle kernel, it is not scheduled. A notebook demonstrating how to create this dataset and what insights it provides can be found here.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
{{language}}_programms_{{split}}.tfrecord: Programs for unsupervised pretraining for java and python languages divided into the train, valid and test split.
keys: code: source code and language: language name.
Facebook
TwitterDownloaded both Python and Debian packages for offline use. The creation and usage is described in https://www.kaggle.com/code/jirkaborovec/pip-pkg-pyvips-download-4-offline
!ls /kaggle/input/pyvips-python-and-deb-package
# intall the deb packages
!dpkg -i --force-depends /kaggle/input/pyvips-python-and-deb-package/linux_packages/archives/*.deb
# install the python wrapper
!pip install pyvips -f /kaggle/input/pyvips-python-and-deb-package/python_packages/ --no-index
!pip list | grep pyvips
Facebook
TwitterThis dataset was created by Pawan Kumar
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is joint work with Gang Wang.
smoker.csv is a simple simulated dataset on treatment results for patients, who may or may not be smokers.
groupon.csv is a dataset of Groupon deals collected by Gang and used in his research paper.
employment.csv is adapted from the dataset in Card and Krueger (1994), which estimates the causal effect of an increase in the state minimum wage on the employment.
Facebook
TwitterThis dataset was created by Oscar Wang
Facebook
TwitterThis dataset was created by HyeongChan Kim
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Anubhav Kumar Gupta
Released under Apache 2.0
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Source code related tasks for machine learning have become important with the large need of software production. In this dataset our main goal is to create a dataset for bug detection and repair.
The dataset is based on the CodeNet project and contains python code submissions for online coding competitions. The data is obtained by selecting consecutive attempts of a single user that resulted in fixing a buggy submission. Thus the data is represented by code pairs and annotated by the diff and error of each changed instruction. We have already tokenized all the source code files and kept the same format as in the original dataset.
CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks
Our goal is to create a bug detection and repair pipeline for online coding competition problems.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Attention! Potential Scraped Python FAQs Inside
This document holds a compilation of frequently asked questions (FAQs) regarding Python, presumably gathered from the authoritative source for all things Python – the official website, python.org. However, a word of caution:
Beware of the Scrape!
Since this collection stems from a scraping process, there's a chance the information might not be current or might lack the necessary context to be fully understood. For the most dependable and comprehensive details about Python, it's always recommended to consult the official Python documentation, which is meticulously maintained and guaranteed to be fresh.
But what if this snippet of scraped FAQs sparks your curiosity?
Well, fret not! This collection can serve as a springboard for further exploration. Look through the questions and if any pique your interest, use them as stepping stones to delve deeper into the official Python resources.
Here are some ways to leverage these FAQs effectively:
Identify areas you'd like to learn more about: If a specific question resonates with you, head over to the official Python documentation and search for that exact topic or its close equivalent.
Gauge your existing Python knowledge: Review the FAQs and see how many you can answer comfortably. This can help you assess your current understanding of Python.
Form a foundation for further learning: These FAQs, although potentially outdated, can provide a basic framework of Python concepts. Use them as a starting point to build your knowledge with the help of the official documentation and other reliable Python learning resources.
Remember, while scraped data can be a handy starting point, official sources are the gold standard for accurate and up-to-date information. So, use this collection with a critical eye and leverage it to springboard your Pythonic journey!
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Allan Kirwa
Released under Apache 2.0
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Mohan Pradhan
Released under Apache 2.0
Facebook
TwitterThis dataset was created by Monis Ahmad
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The dataset contains implementations of different use-cases in the Machine Learning life cycle - from data extraction through deployment. There are paper implementations from scratch, and examples of file handling, model conversion, web scraping, deployment using APIs etc.
Facebook
TwitterThis dataset was created by Valde Junior
Facebook
TwitterThis dataset was created by Kolluri Nithin
Facebook
TwitterAttribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
This dataset is about Python programming. Question and answers are generated using Gemma. There are more than four hundred questions and their corresponding answers about Python programming.
Questions are ranging from concepts like data-types, variables and keywords to regular-expression and threading.
I have used this dataset here
The code used for dataset generated is available here