14 datasets found
  1. Meta Kaggle Code

    • kaggle.com
    zip
    Updated Aug 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
    Explore at:
    zip(153859818696 bytes)Available download formats
    Dataset updated
    Aug 21, 2025
    Dataset authored and provided by
    Kagglehttp://kaggle.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Explore our public notebook content!

    Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

    Why we’re releasing this dataset

    By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

    Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

    The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

    Sensitive data

    While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

    Joining with Meta Kaggle

    The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

    File organization

    The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

    The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

    Questions / Comments

    We love feedback! Let us know in the Discussion tab.

    Happy Kaggling!

  2. Data from: DistilKaggle: a distilled dataset of Kaggle Jupyter notebooks

    • zenodo.org
    application/gzip, bin +1
    Updated Jan 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mojtaba Mostafavi Ghahfarokhi; Mojtaba Mostafavi Ghahfarokhi; Arash Asgari; Arash Asgari; Mohammad Abolnejadian; Mohammad Abolnejadian; Abbas Heydarnoori; Abbas Heydarnoori (2024). DistilKaggle: a distilled dataset of Kaggle Jupyter notebooks [Dataset]. http://doi.org/10.5281/zenodo.10317389
    Explore at:
    bin, csv, application/gzipAvailable download formats
    Dataset updated
    Jan 27, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mojtaba Mostafavi Ghahfarokhi; Mojtaba Mostafavi Ghahfarokhi; Arash Asgari; Arash Asgari; Mohammad Abolnejadian; Mohammad Abolnejadian; Abbas Heydarnoori; Abbas Heydarnoori
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    DistilKaggle is a curated dataset extracted from Kaggle Jupyter notebooks spanning from September 2015 to October 2023. This dataset is a distilled version derived from the download of over 300GB of Kaggle kernels, focusing on essential data for research purposes. The dataset exclusively comprises publicly available Python Jupyter notebooks from Kaggle. The essential information for retrieving the data needed to download the dataset is obtained from the MetaKaggle dataset provided by Kaggle.

    Contents

    The DistilKaggle dataset consists of three main CSV files:

    code.csv: Contains over 12 million rows of code cells extracted from the Kaggle kernels. Each row is identified by the kernel's ID and cell index for reproducibility.

    markdown.csv: Includes over 5 million rows of markdown cells extracted from Kaggle kernels. Similar to code.csv, each row is identified by the kernel's ID and cell index.

    notebook_metrics.csv: This file provides notebook features described in the accompanying paper released with this dataset. It includes metrics for over 517,000 Python notebooks.

    Directory Structure

    The kernels directory is organized based on Kaggle's Performance Tiers (PTs), a ranking system in Kaggle that classifies users. The structure includes PT-specific directories, each containing user ids that belong to this PT, download logs, and the essential data needed for downloading the notebooks.

    The utility directory contains two important files:

    aggregate_data.py: A Python script for aggregating data from different PTs into the mentioned CSV files.

    application.ipynb: A Jupyter notebook serving as a simple example application using the metrics dataframe. It demonstrates predicting the PT of the author based on notebook metrics.

    DistilKaggle.tar.gz: It is just the compressed version of the whole dataset. If you downloaded all of the other files independently already, there is no need to download this file.

    Usage

    Researchers can leverage this distilled dataset for various analyses without dealing with the bulk of the original 300GB dataset. For access to the raw, unprocessed Kaggle kernels, researchers can request the dataset directly.

    Note

    The original dataset of Kaggle kernels is substantial, exceeding 300GB, making it impractical for direct upload to Zenodo. Researchers interested in the full dataset can contact the dataset maintainers for access.

    Citation

    If you use this dataset in your research, please cite the accompanying paper or provide appropriate acknowledgment as outlined in the documentation.

    If you have any questions regarding the dataset, don't hesitate to contact me at mohammad.abolnejadian@gmail.com

    Thank you for using DistilKaggle!

  3. Titanic Dataset

    • kaggle.com
    Updated Apr 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Mudasar Sabir (2025). Titanic Dataset [Dataset]. https://www.kaggle.com/datasets/mudasarsabir/titanic-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 25, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Muhammad Mudasar Sabir
    Description

    Description 👋🛳️ Ahoy, welcome to Kaggle! You’re in the right place. This is the legendary Titanic ML competition – the best, first challenge for you to dive into ML competitions and familiarize yourself with how the Kaggle platform works.

    If you want to talk with other users about this competition, come join our Discord! We've got channels for competitions, job postings and career discussions, resources, and socializing with your fellow data scientists. Follow the link here: https://discord.gg/kaggle

    The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

    Read on or watch the video below to explore more details. Once you’re ready to start competing, click on the "Join Competition button to create an account and gain access to the competition data. Then check out Alexis Cook’s Titanic Tutorial that walks you through step by step how to make your first submission!

    The Challenge The sinking of the Titanic is one of the most infamous shipwrecks in history.

    On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

    While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

    In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

    Recommended Tutorial We highly recommend Alexis Cook’s Titanic Tutorial that walks you through making your very first submission step by step and this starter notebook to get started.

    How Kaggle’s Competitions Work Join the Competition Read about the challenge description, accept the Competition Rules and gain access to the competition dataset. Get to Work Download the data, build models on it locally or on Kaggle Notebooks (our no-setup, customizable Jupyter Notebooks environment with free GPUs) and generate a prediction file. Make a Submission Upload your prediction as a submission on Kaggle and receive an accuracy score. Check the Leaderboard See how your model ranks against other Kagglers on our leaderboard. Improve Your Score Check out the discussion forum to find lots of tutorials and insights from other competitors. Kaggle Lingo Video You may run into unfamiliar lingo as you dig into the Kaggle discussion forums and public notebooks. Check out Dr. Rachael Tatman’s video on Kaggle Lingo to get up to speed!

    What Data Will I Use in This Competition? In this competition, you’ll gain access to two similar datasets that include passenger information like name, age, gender, socio-economic class, etc. One dataset is titled train.csv and the other is titled test.csv.

    Train.csv will contain the details of a subset of the passengers on board (891 to be exact) and importantly, will reveal whether they survived or not, also known as the “ground truth”.

    The test.csv dataset contains similar information but does not disclose the “ground truth” for each passenger. It’s your job to predict these outcomes.

    Using the patterns you find in the train.csv data, predict whether the other 418 passengers on board (found in test.csv) survived.

    Check out the “Data” tab to explore the datasets even further. Once you feel you’ve created a competitive model, submit it to Kaggle to see where your model stands on our leaderboard against other Kagglers.

    How to Submit your Prediction to Kaggle Once you’re ready to make a submission and get on the leaderboard:

    Click on the “Submit Predictions” button

    Upload a CSV file in the submission file format. You’re able to submit 10 submissions a day.

    Submission File Format: You should submit a csv file with exactly 418 entries plus a header row. Your submission will show an error if you have extra columns (beyond PassengerId and Survived) or rows.

    The file should have exactly 2 columns:

    PassengerId (sorted in any order) Survived (contains your binary predictions: 1 for survived, 0 for deceased) Got it! I’m ready to get started. Where do I get help if I need it? For Competition Help: Titanic Discussion Forum Kaggle doesn’t have a dedicated team to help troubleshoot your code so you’ll typically find that you receive a response more quickly by asking your question in the appropriate forum. The forums are full of useful information on the data, metric, and different approaches. We encourage you to use the forums often. If you share your knowledge, you'll find that others will share a lot in turn!

    A Last Word on Kaggle Notebooks As we mentioned before, Kaggle Notebooks is our no-setup, customizable, Jupyter Notebooks environment with free GPUs and a huge repository ...

  4. Lifelines python library

    • kaggle.com
    Updated Jan 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dipankar Mitra (2025). Lifelines python library [Dataset]. https://www.kaggle.com/datasets/dipankarthekohda/lifelines-python-library/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 6, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Dipankar Mitra
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    It is downloaded package of lifelines package for internet off installation in your notebook, which may be required for various competitions.

    Just add the dataset in your notebook and run: !pip install --no-index --find-links=/kaggle/input/lifelines-python-library/lifelines_and_dependencies lifelines autograd-gamma instead of pip install lifelines which requires internet on

    Don't forget to leave an upvote!!👋

  5. TinyStories

    • kaggle.com
    • opendatalab.com
    • +1more
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). TinyStories [Dataset]. https://www.kaggle.com/datasets/thedevastator/tinystories-narrative-classification
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 24, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    TinyStories

    A Diverse, Richly Annotated Corpus of Short-Form Stories

    By Huggingface Hub [source]

    About this dataset

    This dataset contains the text of a remarkable collection of short stories known as the TinyStories Corpus. With over 2,000 annotated stories, it is populated with an array of diverse styles and genres from multiple sources. This corpus is enriched by intricate annotations across each narrative content, making it a valuable resource for narrative text classification. The text field in each row includes the entirety of each story that can be used to identify plots, characters and other features associated with story-telling techniques. Through this collection of stories, users will gain an extensive insight into a wide range of narratives which could be used to produce powerful machine learning models for Narrative Text Classification

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    In this dataset, each row contains a short story along with its associated labels for narrative text classification tasks. The data consists of the following columns: - text: The story text itself (string) - validation.csv: Contains a set of short stories for validation (dataframe) - train.csv: Contains the text of short stories used for narrative text classification (dataframe)

    The data contained in both files can be used for various types of machine learning tasks related to narrative text classification. These include but are not limited to experiments such as determining story genres, predicting user reactions, sentiment analysis etc.

    To get started with using this dataset, begin by downloading both validation and train csv files from Kaggle datasets page and saving them on your computer or local environment. Once downloaded, you may need to preprocess both datasets by cleaning up any unnecessary/wrongly formatted values or duplicate entries if any exists within it before proceeding further on to your research work or machine learning task experimentations as these have great impacts on your research results accuracy rate which you do not want compromised!

    Next step is simply loading up these two datasets into Python pandas dataframes so that they can easily be manipulated and analyzed using common tools associated with Natural Language Processing(NLP). This would require you writing few simple lines using pandas API functions like read_csv(), .append(), .concat()etc depending upon what kind of analysis/experiment you intend conducting afterwards utilizing this dataset in Python Jupyter Notebook framework as well as other machine learning frameworks popular among data scientists like scikit-learn if it will be something more complex than simple NLP task operations!

    By now if done everything mentioned correctly here then we are ready now to finally get into actually working out our desired applications from exploring potential connections between different narratives or character traits via supervised Machine Learning models such as Naive Bayes Classifier among many others that could ultimately provide us useful insights revealing patterns existing underneath all those texts! With all necessary datas loaded up in supporting python platforms correctly so feel free to make interesting discoveries/predictions from extensive analyses provided by this richly annotated TinyStories Narrative Dataset!

    Research Ideas

    • Creating a text classification algorithm to automatically categorize short stories by genre.
    • Developing an AI-based summarization tool to quickly summarize the main points in a story.
    • Developing an AI-based story generator that can generate new stories based on existing ones in the dataset

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: validation.csv | Column name | Description | |:--------------|:--------------------------------| | text | The text of the story. (String) |

    File: train.csv | Column name | Description | |:--------------|:----------------------------...

  6. MedMNIST: Standardized Biomedical Images

    • kaggle.com
    Updated Feb 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Möbius (2024). MedMNIST: Standardized Biomedical Images [Dataset]. https://www.kaggle.com/datasets/arashnic/standardized-biomedical-images-medmnist
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 2, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Möbius
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    "'https://www.nature.com/articles/s41597-022-01721-8'">MedMNIST v2 - A large-scale lightweight benchmark for 2D and 3D biomedical image classification https://www.nature.com/articles/s41597-022-01721-8

    A large-scale MNIST-like collection of standardized biomedical images, including 12 datasets for 2D and 6 datasets for 3D. All images are pre-processed into 28x28 (2D) or 28x28x28 (3D) with the corresponding classification labels, so that no background knowledge is required for users. Covering primary data modalities in biomedical images, MedMNIST is designed to perform classification on lightweight 2D and 3D images with various data scales (from 100 to 100,000) and diverse tasks (binary/multi-class, ordinal regression and multi-label). The resulting dataset, consisting of approximately 708K 2D images and 10K 3D images in total, could support numerous research and educational purposes in biomedical image analysis, computer vision and machine learning.Providers benchmark several baseline methods on MedMNIST, including 2D / 3D neural networks and open-source / commercial AutoML tools.

    MedMNIST Landscape :

    https://storage.googleapis.com/kagglesdsdata/datasets/4390240/7539891/medmnistlandscape.png?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=databundle-worker-v2%40kaggle-161607.iam.gserviceaccount.com%2F20240202%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240202T132716Z&X-Goog-Expires=345600&X-Goog-SignedHeaders=host&X-Goog-Signature=479c8d80a4c6f28bf9532fea037969292a4f963662b022484a79c139297cfa1afc82db06c9b5275d6c52d5555d7fb178701d3ad7ebb036c9cf3d076fcf41014c05a6230d293f39dd320303efaa81d18e9c5888c23fe19884148a3be618e3e7c041383119a4c5547f0fa6cb1ddb5f3bf4dc1330a6fd5c693f32280e90fde5735e02052f2fc5b0003085d9ea70039903439814154dc39980dce3bace422d0672a69c4f4cefbe6bcebaacd2c5192a60172143667b14ba050a8383d0a7c6c639526c820ae58bbad99b4afc84e97bc87b2da6002d6faf181d4138e2a33961514370578892409b1e1a662424051573a3392273b00132a4f39becff877dff16a594848f" alt="medmnistlandscape">

    About MedMNIST Landscape figure: The horizontal axis denotes the base-10 logarithm of the dataset scale, and the vertical axis denotes base-10 logarithm of imaging resolution. The upward and downward triangles are used to distinguish between 2D datasets and 3D datasets, and the 4 different colors represent different tasks

    Key Features

    ###

    Diverse: It covers diverse data modalities, dataset scales (from 100 to 100,000), and tasks (binary/multi-class, multi-label, and ordinal regression). It is as diverse as the VDD and MSD to fairly evaluate the generalizable performance of machine learning algorithms in different settings, but both 2D and 3D biomedical images are provided.

    Standardized: Each sub-dataset is pre-processed into the same format, which requires no background knowledge for users. As an MNIST-like dataset collection to perform classification tasks on small images, it primarily focuses on the machine learning part rather than the end-to-end system. Furthermore, we provide standard train-validation-test splits for all datasets in MedMNIST, therefore algorithms could be easily compared.

    User-Friendly: The small size of 28×28 (2D) or 28×28×28 (3D) is lightweight and ideal for evaluating machine learning algorithms. We also offer a larger-size version, MedMNIST+: 64x64 (2D), 128x128 (2D), 224x224 (2D), and 64x64x64 (3D). Serving as a complement to the 28-size MedMNIST, this could be a standardized resource for developing medical foundation models. All these datasets are accessible via the same API.

    Educational: As an interdisciplinary research area, biomedical image analysis is difficult to hand on for researchers from other communities, as it requires background knowledge from computer vision, machine learning, biomedical imaging, and clinical science. Our data with the Creative Commons (CC) License is easy to use for educational purposes.

    Refer to the paper to learn more about data : https://www.nature.com/articles/s41597-022-01721-8

    Starter Code: download more data and training

    Github Page: https://github.com/MedMNIST/MedMNIST

    My Kaggle Starter Notebook: https://www.kaggle.com/code/arashnic/medmnist-download-and-use-data?scriptVersionId=161421937

    Acknowledgements

    Jiancheng Yang,Rui Shi,Donglai Wei,Zequan Liu,Lin Zhao,Bilian Ke,Hanspeter Pfister,Bingbing Ni Shanghai Jiao Tong University, Shanghai, China, Boston College, Chestnut Hill, MA RWTH Aachen University, Aachen, Germany, Fudan Institute of Metabolic Diseases, Zhongshan Hospital, Fudan University, Shanghai, China, Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China, Harvard University, Cambridge, MA

    License and Citation

    The code is under Apache-2.0 License.

    The MedMNIST dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0)...

  7. 📊 Meta Kaggle| Kaggle Users' Stats

    • kaggle.com
    Updated Jun 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BwandoWando (2025). 📊 Meta Kaggle| Kaggle Users' Stats [Dataset]. http://doi.org/10.34740/kaggle/dsv/10595847
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 26, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    BwandoWando
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Image

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1842206%2Ff84a67b64934ccfdd6fd4bfc24db094d%2F_982f849a-87df-44ff-94ff-3fc97c6198aa-small2.jpeg?generation=1738169001850229&alt=media" alt="">

    History

    • 03Mar2025- when determining last content shared, I am now using the latest version of Model, Dataset, and Notebook, rather than the creation date of the very first version. I also added the reaction counts which was a new csv added in the MetaKaggle dataset. The discussion can be found here . I also added versions created for Model, Notebook, and Dataset to properly track users that are updating their datasets.
    • 04Feb2025- Fixed the issue on ModelUpvotesGiven and ModelUpvotesReceived values being identical

    Context

    User aggregated stats and data using the Official Meta Kaggle dataset

    Note

    Expect some discrepancies between the counts seen in your profile, because, aside from there is a lag of one to two days before a new dataset is published, some information such as Kaggle staffs' upvotes and private competitions are not included. But for almost all members, the figures should reconcile

    Notebook updater

    📊 (Scheduled) Meta Kaggle Users' Stats

    Image

    Generated with Bing image generator

  8. 🚴🗃️ BCN Bike Sharing Dataset - Bicing Stations

    • kaggle.com
    Updated Apr 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Enric Domingo (2024). 🚴🗃️ BCN Bike Sharing Dataset - Bicing Stations [Dataset]. https://www.kaggle.com/datasets/edomingo/bicing-stations-dataset-bcn-bike-sharing/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 21, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Enric Domingo
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset contains 250 million rows of information from the ~500 bike stations of the Barcelona public bicycle sharing service. The data consists in time series information of the electric and mechanical bicycles available every 4 minutes aprox., from March 2019 to March 2024 (latest available csv file, with the idea of being updated with every new month's file). This data could inspire many different use cases, from geographical data analysis to hierarchical ML time series models or Graph Neural Networks among others. Feel free to create a New Notebook from this page to use it and share your ideas with everyone!

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3317928%2F64409b5bd3c220993e05f5e155fd8c25%2Fstations_map_2024.png?generation=1713725887609128&alt=media" alt="">

    Every month's information is separated in a different file as {year}_{month}_STATIONS.csv. Then the metadata info of every station has been simplified and compressed in the {year}_INFO.csv files where there is a single entry for every station and day, separated in a different file for every year.

    The original data has some different errors, few of them have been already corrected but there are still some missing values, columns with wrong data types and other fewer artifacts or missing data. From time to time I may be manually correcting more of those.

    The data is collected from the public BCN Open Data website, which is available for everyone (some resources need from creating a free account and token): - Stations data: https://opendata-ajuntament.barcelona.cat/data/en/dataset/estat-estacions-bicing - Stations info: https://opendata-ajuntament.barcelona.cat/data/en/dataset/informacio-estacions-bicing

    You can find more information in them.

    Please, consider upvoting this dataset if you find it interesting! 🤗

    Some observations:
    The historical data for June '19 does not have data for the 20th between 7:40 am and 2:00 pm.
    The historical data for July '19 does not have data from the 26th at 1:30 pm until the 29th at 10:40 am.
    The historical data for November '19 may not have some data from 10:00 pm on the 26th to 11:00 am on the 27th.
    The historical data for August '20 does not have data from the 7th at 2:25 am until the 10th at 10:40 am.
    The historical data for November '20 does not have data on the following days/times: 4th from 1:45 am to 11:05 am 20th from 7:50 pm to the 21st at 10:50 am 27th from 2:50 am to the 30th at 9:50 am.
    The historical data for August '23 does not have data from the 22nd to the 31st due to a technical incident.
    The historical data for September '23 does not have data from the 1st to the 5th due to a technical incident.
    The historical data for February '24 does not have data on the 5th between 12:50 pm and 1:05 pm.
    Others: Due to COVID-19 measures, the Bicing service was temporarily stopped, reflecting this situation in the historical data.

    Field Description:

    Array of data for each station:

    station_id: Identifier of the station
    num_bikes_available: Number of available bikes
    num_bikes_available_types: Array of types of available bikes
    mechanical: Number of available mechanical bikes
    ebike: Number of available electric bikes
    num_docks_available: Number of available docks
    is_installed: The station is properly installed (0-NO,1-YES)
    is_renting: The station is providing bikes correctly
    is_returning: The station is docking bikes correctly
    last_reported: Timestamp of the station information
    is_charging_station: The station has electric bike charging capacity
    status: Status of the station (IN_SERVICE=In service, CLOSED=Closed)

  9. National Health and Nutrition Examination Survey

    • kaggle.com
    Updated Jan 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). National Health and Nutrition Examination Survey [Dataset]. https://www.kaggle.com/datasets/thedevastator/national-health-and-nutrition-examination-survey
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 12, 2023
    Dataset provided by
    Kaggle
    Authors
    The Devastator
    Description

    National Health and Nutrition Examination Survey (NHANES) Data

    Health Indicators for Different Locations

    By Centers for Disease Control and Prevention [source]

    About this dataset

    This dataset offers an in-depth look into the National Health and Nutrition Examination Survey (NHANES), which provides valuable insights on various health indicators throughout the United States. It includes important information such as the year when data was collected, location of the survey, data source and value, priority areas of focus, category and topic related to the survey, break out categories of data values, geographic location coordinates and other key indicators.Discover patterns in mortality rates from cardiovascular disease or analyze if pregnant women are more likely to report poor health than those who are not expecting with this NHANES dataset — a powerful collection for understanding personal health behaviors

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    Step 1: Understand the Data Format - Before beginning to work with NHANES data, you should become familiar with the different columns in the dataset. Each column contains a specific type of information about the data such as year collected, geographic location abbreviations and descriptions, sources used for collecting data, priority areas assigned by researchers or institutions associated with understanding health trends in a given area or population group as well as indicator values related to nutrition/health.

    Step 2: Choose an Indicator - Once you understand what is included in each column and what type of values correspond to each field it is time to select which indicator(s) you would like plots or visualizations against demographic/geographical characteristics represented by NHANES data. Selecting an appropriate indicator helps narrow down your search criteria when conducting analyses of health/nutrition trends over time in different locations or amongst different demographic groups.

    Step 3: Utilizing Subsets - When narrowing down your search criteria it may be beneficial to break up large datasets into smaller subsets that focus on a single area or topic for study (i.e., looking at nutrition trends among rural communities). This allows users to zoom into certain datasets if needed within their larger studies so they can further drill down on particular topics that are relevant for their research objectives without losing greater context from more general analysis results when viewing overall datasets containing all available fields for all locations examined by NHANES over many years of records collected at specific geographical areas requested within the parameters set forth by those wanting insights from external research teams utilizing this dataset remotely via Kaggle access granted through user accounts giving them authorized access controls solely limited by base administration permissions set forth where required prior granting needs authorization process has been met prior downloading/extraction activities successful completion finalized allowed beyond initial site signup page make sure rules followed while also ensuring positive experience interactive engagement processes fluid flow signature one-time registration entry after exit page exits once completed neutralize logout button pops finish downloading extract image files transfer end destination requires hard drive storage efficient manner duplicate second backup remain resilient mitigate file corruption concerns start working properly formatted smooth transition between systems be seamless reflective channel dynamic organization approach complement function beneficial effort allow comprehensive review completed quality control standards align desires outcomes desired critical path

    Research Ideas

    • Creating a health calculator to help people measure their health risk. The indicator and data value fields can be used to create an algorithm that will generate a personalized label for each user's health status.
    • Developing a visual representation of the nutritional habits of different populations based on the DataSource, LocationAbbr, and PriorityArea fields from this dataset.
    • Employing machine learning to discern patterns in the data or predict potential health risks in different regions or populations by using the GeoLocation field as inputs for geographic analysis.

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    **Unknown License - Please check the dataset description for more information....

  10. DailyDialog (Multi-turn Dialog)

    • kaggle.com
    Updated Nov 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). DailyDialog (Multi-turn Dialog) [Dataset]. https://www.kaggle.com/datasets/thedevastator/dailydialog-unlock-the-conversation-potential-in
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 29, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    DailyDialog (Multi-turn Dialog)

    Dialogues that reflect our daily communication way and cover various topics

    By Huggingface Hub [source]

    About this dataset

    Welcome to the DailyDialog dataset, your gateway to unlocking conversation potential through multi-turn dialog experiences! Our dataset consists of conversations written by humans, which serve as a more accurate reflection of our day-to-day conversations than other datasets. Additionally, we have included manually labeled communication intentions and emotion fields in our data that can be used for advancing dialog systems.

    Whether you’re a researcher looking for new approaches in dialog systems or someone simply curious about conversation dynamics from the perspective of computer science – this dataset is here to help! We invite you to explore and make use of this data for its full potential and advance the research field further.

    Our three main files (train.csv, validation.csv, test.csv) each provide key columns such as dialogue , act , and emotion , enabling you to get an even deeper understanding into how effective conversations really work -- so what are you waiting for? Unlock your conversation potential today with DailyDialog!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    Welcome and thank you for your interest in the DailyDialog dataset! This dataset is designed to unlock conversation potential through multi-turn dialog experiences and provide a better understanding of conversations in our day-to-day lives. Whether you are a student, researcher, or just plain curious, this guide is here to help you get started with using the DailyDialog dataset for your own research or exploration.

    The DailyDialog dataset includes three files: train.csv, validation.csv, and test.csv which all contain dialog, act and emotion fields that can be used by those who wish to evaluate existing approaches in the field of dialogue systems or perform new experiments on conversational models. All data found in this dataset is written by humans and thus contains less noise than other datasets typically seen online.

    The first step when using this data set would be to familiarize yourself with the different fields found within each file: * Dialog – The dialog field contains the conversation between two people (String). * Act – The act field contains communication intentions of both parties involved within the dialogue (String). * Emotion – The emotion field labels any emotions expressed during a particular dialogue (String).

    Once you understand what each of these three fields mean it’s time to start exploring! You can use any programming language/software as well as statistical methods such as text analysis tools like RapidMiner or Natural Language Processing libraries like NLTK or Spacy to use these fields in order to further explore them individually or together on a more profound level. Additionally, if you are interested specifically into machine learning tasks there might also be possibilities such as generating new conversations from our data set (e.g., chat bots) using reinforcement learning models such deep learning architectures / neural networks for natural language understanding tasks etc..which can be explored too!

    All said done we believe that the ability of unlocking underlying patterns embedded within real life conversations will enable researchers in various domains & research areas (e.g., AI / ML ones) enable their efforts great success & have an exciting journey :)

    Research Ideas

    • Developing a conversational AI system that can replicate authentic conversations by modeling the emotion and communication intentions present in the DailyDialog dataset.
    • Creating a language-learning tool which can customize personalized dialogues based on the DailyDialog data to help foreign language learners get used to spoken dialogue.
    • Utilizing the DailyDialog data to develop an interactive chatbot with customized responses and emotions, allowing users to learn more about their conversational skills through simulated conversations

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. [See Other Information](https://creativecommons...

  11. WikiTableQuestions (Semi-structured Tables Q&A)

    • kaggle.com
    Updated Nov 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). WikiTableQuestions (Semi-structured Tables Q&A) [Dataset]. https://www.kaggle.com/datasets/thedevastator/investigation-of-semi-structured-tables-wikitabl
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 27, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Investigation of Semi-Structured Tables: WikiTableQuestions

    A Dataset of Complex Questions on Semi-Structured Wikipedia Tables

    By [source]

    About this dataset

    The WikiTableQuestions dataset poses complex questions about the contents of semi-structured Wikipedia tables. Beyond merely testing a model's knowledge retrieval capabilities, these questions require an understanding of both the natural language used and the structure of the table itself in order to provide a correct answer. This makes the dataset an excellent testing ground for AI models that aim to replicate or exceed human-level intelligence

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    In order to use the WikiTableQuestions dataset, you will need to first understand the structure of the dataset. The dataset is comprised of two types of files: questions and answers. The questions are in natural language, and are designed to test a model's ability to understand the table structure, understand the natural language question, and reason about the answer. The answers are in a list format, and provide additional information about each table that can be used to answer the questions.

    To start working with the WikiTableQuestions dataset, you will need to download both the questions and answers files. Once you have downloaded both files, you can begin working with the dataset by loading it into a pandas dataframe. From there, you can begin exploring the data and developing your own models for answering the questions.

    Happy Kaggling!

    Research Ideas

    • The WikiTableQuestions dataset can be used to train a model to answer complex questions about semi-structured Wikipedia tables.

    • The WikiTableQuestions dataset can be used to train a model to understand the structure of semi-structured Wikipedia tables.

    • The WikiTableQuestions dataset can be used to train a model to understand the natural language questions and reason about the answers

    Acknowledgements

    If you use this dataset in your research, please credit the original authors.

    Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: 0.csv

    File: 1.csv

    File: 10.csv

    File: 11.csv

    File: 12.csv

    File: 14.csv

    File: 15.csv

    File: 17.csv

    File: 18.csv

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .

  12. VGG-Sound: only cat and dog sounds

    • kaggle.com
    Updated Apr 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikita (2024). VGG-Sound: only cat and dog sounds [Dataset]. https://www.kaggle.com/datasets/kitonbass/vgg-sound-only-cat-and-dog-sounds
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 19, 2024
    Dataset provided by
    Kaggle
    Authors
    Nikita
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    This is part of VGGSound dataset with everything related to cats and dogs converted to 10 seconds 16kHz mono wav. I made it for my University research, because original dataset is kind of huge :)

    There are also two csv files with train/test split collected from VGG Sound splits. All data numbered according to indexes of original csv tables.

    Each line in the csv file has columns defined by here: Index in original VGGSound (my addition), YouTube ID, start seconds, label, train/test split.

    Also, some of the video links (~800 of them) in tables lead to unavailable videos (age restricted/deleted/etc.), which was not downloaded and therefore is not here – so there would be no audio for some indexes.

    The example of real practice use of the dataset can be found in my VQ-VAE 2 notebook 👨‍💻. I've also got this helper notebook 🐱🐶 which shows some simple actions you can do with audio dat. In particular: - Some of the audio files are 9 seconds long – how to pad it - How to prepare spectrograms to use them as regular pictures

    And umm I could not figure out how to do a proper citation, but here it is from original VGGSound

    @InProceedings{Chen20,
     author    = "Honglie Chen and Weidi Xie and Andrea Vedaldi and Andrew Zisserman",
     title    = "VGGSound: A Large-scale Audio-Visual Dataset",
     booktitle  = "International Conference on Acoustics, Speech, and Signal Processing (ICASSP)",
     year     = "2020",
    }
    
  13. 3-D Anthropometry Measurements of Human Body

    • kaggle.com
    Updated Jan 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). 3-D Anthropometry Measurements of Human Body [Dataset]. https://www.kaggle.com/datasets/thedevastator/3-d-anthropometry-measurements-of-human-body-sur
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 12, 2023
    Dataset provided by
    Kaggle
    Authors
    The Devastator
    Description

    3-D Anthropometry Measurements of Human Body Surface

    A Novel Tool for Computer-Aided Design

    By Andy R. Terrel [source]

    About this dataset

    This survey utilizes the cutting-edge three-dimensional (3-D) surface anthropometry technology, which measures the outermost surface of the human body. These technologies are a breakthrough in measuring capabilities, as they can accurately record hundreds of thousands of points in three dimensions in only a few seconds. With this data, designers and engineers are able to use computer-aided design tools and rapid prototyping in conjunction with more realistic postures to create better designs for their target audience more effectively.

    Surface anthropometry has many advantages over traditional measuring methods like rulers and tape measures: it helps reduce guesswork through its accuracy; it allows measurements to be taken long after a subject has left; it provides an efficient way to capture individuals while wearing clothing, equipment or any other accessories; each measurement is comparable with those collected by other groups regardless of who took them; and lastly, the system is non-contact so there’s no risk for discrepancies between different measurers.

    Our survey will look at 3 dimensional body measurements such demographics like age, gender, reported height and weight as well as individual body parts such waist circumference preferred braid size cup size ankle circumference scye circumference chest circumferences hip height spine elbow length arm part lengths should get out seams sleeveinseam biacromial breadth bicristal breadth bustbusters cervical height chest – els interscye distance acromion Hight acromion radial length axilla heights elbow heights knee heights radial mation length hand late neck circumstance based these 3 dimes entails taken from our dataset Caesarz dot csv make sure you provide us with all the necessary information thank you

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset is provided to help researchers, designers, engineers and other professionals in related fields use 3-D surface anthropometry technology to effectively measure the outer surface of the human body.

    Using this dataset can enable you to capture hundreds of thousands of points in three-dimensions on the human body surface. This data provides insights into sizing, fitting and proportions of a range of different body shapes and sizes which can be incredibly useful for many purposes like fashion design or biomedical research.

    To get started with this dataset it is helpful to become familiar with some basic terminology such as biacromial breadth (the distance between furthest points on left and right shoulder), bicristal breadth (waist width measurement) , kneem height (the vertical distance from hip joint center to kneecap), ankle circumference (measurement taken at ankle joint) etc. Knowing these measurements can help you better interpret and utilize the data provided in this survey.

    Next up, you’ll want familiarise yourself with the various measurements given for each column in this dataset including: age (Integer) , num_children (Integer) , gender (String) , reported_height (Float) , reported_weight (Float) . & more Once ready dive into the data by downloading it into your chosen analysis tool - popular options including KNIME or R Studio! You’ll be able to explore correlations between size & shape metrics as well as discovering patterns between participants based on gender/age etc. Spend some time getting comfortable playing around with your chosen system & just keep exploring interesting connections! Finally if there's a specific use case you have don't forget that user-defined variables are also possible - so create variables when needed! Thanks so much for taking part in our survey & we wish you all best luck analyzing the data - we hope it's useful!

    Research Ideas

    • Developing web-based applications or online platforms for measuring body dimensions using 3D technology for custom clothing and equipment.
    • Establishing anthropometric databases, allowing user to easily find measurements of all kinds of body shapes and sizes;
    • Analyzing patterns between anthropometric measurements and clinical data such as BMI (body mass index) to benefit the understanding of human health status and nutrition needs

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    **License: [Dataset copyright by authors](http...

  14. Retinal Disease Classification

    • kaggle.com
    Updated Aug 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Larxel (2021). Retinal Disease Classification [Dataset]. https://www.kaggle.com/datasets/andrewmvd/retinal-disease-classification/code?resource=download
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 16, 2021
    Dataset provided by
    Kaggle
    Authors
    Larxel
    Description

    About this dataset

    According to the WHO, World report on vision 2019, the number of visually impaired people worldwide is estimated to be 2.2 billion, of whom at least 1 billion have a vision impairment that could have been prevented or is yet to be addressed. The world faces considerable challenges in terms of eye care, including inequalities in the coverage and quality of prevention, treatment, and rehabilitation services. Early detection and diagnosis of ocular pathologies would enable forestall of visual impairment.

    For this purpose, we have created a new Retinal Fundus Multi-disease Image Dataset (RFMiD) consisting of a total of 3200 fundus images captured using three different fundus cameras with 46 conditions annotated through adjudicated consensus of two senior retinal experts.

    How to use this dataset

    • Create model to Multi Disease Classification model
    • Create a model to classify between Healthy and Unhealthy retinas

    Highlighted Notebooks

    Acknowledgements

    If you use this dataset in your research, please credit the authors

    Citation

    Samiksha Pachade, Prasanna Porwal, Dhanshree Thulkar, Manesh Kokare, Girish Deshmukh, Vivek Sahasrabuddhe, Luca Giancardo, Gwenolé Quellec, and Fabrice Mériaudeau, 2021. Retinal Fundus Multi-Disease Image Dataset (RFMiD): A Dataset for Multi-Disease Detection Research. Data, 6(2), p.14. Available (Open Access): https://www.mdpi.com/2306-5729/6/2/14

    License

    License was not specified, yet a citation was requested whenever the data is used.

    Splash banner

    Icon by Eucalyp on FlatIcon.

  15. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
Organization logo

Meta Kaggle Code

Kaggle's public data on notebook code

Explore at:
zip(153859818696 bytes)Available download formats
Dataset updated
Aug 21, 2025
Dataset authored and provided by
Kagglehttp://kaggle.com/
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Explore our public notebook content!

Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

Why we’re releasing this dataset

By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

Sensitive data

While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

Joining with Meta Kaggle

The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

File organization

The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

Questions / Comments

We love feedback! Let us know in the Discussion tab.

Happy Kaggling!

Search
Clear search
Close search
Google apps
Main menu