100+ datasets found
  1. Meta Kaggle Code

    • kaggle.com
    zip
    Updated Jul 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
    Explore at:
    zip(148301844275 bytes)Available download formats
    Dataset updated
    Jul 10, 2025
    Dataset authored and provided by
    Kagglehttp://kaggle.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Explore our public notebook content!

    Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

    Why we’re releasing this dataset

    By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

    Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

    The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

    Sensitive data

    While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

    Joining with Meta Kaggle

    The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

    File organization

    The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

    The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

    Questions / Comments

    We love feedback! Let us know in the Discussion tab.

    Happy Kaggling!

  2. Basic R for Data Analysis

    • kaggle.com
    Updated Dec 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kebba Ndure (2024). Basic R for Data Analysis [Dataset]. https://www.kaggle.com/datasets/kebbandure/basic-r-for-data-analysis/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 8, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kebba Ndure
    Description

    ABOUT DATASET

    This is the R markdown notebook. It contains step by step guide for working on Data Analysis with R. It helps you with installing the relevant packages and how to load them. it also provides a detailed summary of the "dplyr" commands that you can use to manipulate your data in the R environment.

    Anyone new to R and wish to carry out some data analysis on R can check it out!

  3. Market Basket Analysis

    • kaggle.com
    Updated Dec 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 9, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aslan Ahmedov
    Description

    Market Basket Analysis

    Market basket analysis with Apriori algorithm

    The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

    Introduction

    Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

    An Example of Association Rules

    Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

    Strategy

    • Data Import
    • Data Understanding and Exploration
    • Transformation of the data – so that is ready to be consumed by the association rules algorithm
    • Running association rules
    • Exploring the rules generated
    • Filtering the generated rules
    • Visualization of Rule

    Dataset Description

    • File name: Assignment-1_Data
    • List name: retaildata
    • File format: . xlsx
    • Number of Row: 522065
    • Number of Attributes: 7

      • BillNo: 6-digit number assigned to each transaction. Nominal.
      • Itemname: Product name. Nominal.
      • Quantity: The quantities of each product per transaction. Numeric.
      • Date: The day and time when each transaction was generated. Numeric.
      • Price: Product price. Numeric.
      • CustomerID: 5-digit number assigned to each customer. Nominal.
      • Country: Name of the country where each customer resides. Nominal.

    imagehttps://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

    Libraries in R

    First, we need to load required libraries. Shortly I describe all libraries.

    • arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
    • arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.
    • tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.
    • readxl - Read Excel Files in R.
    • plyr - Tools for Splitting, Applying and Combining Data.
    • ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
    • knitr - Dynamic Report generation in R.
    • magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.
    • dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
    • tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

    imagehttps://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

    Data Pre-processing

    Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

    imagehttps://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> imagehttps://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

    After we will clear our data frame, will remove missing values.

    imagehttps://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

    To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...

  4. Road-R Dataset

    • kaggle.com
    Updated Aug 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sciencestoked (2023). Road-R Dataset [Dataset]. https://www.kaggle.com/datasets/sciencestoked/road-r-dataset/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 17, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    sciencestoked
    Description

    Dataset

    This dataset was created by sciencestoked

    Contents

  5. Reddit Conversations

    • kaggle.com
    Updated Mar 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jerry Qu (2020). Reddit Conversations [Dataset]. https://www.kaggle.com/jerryqu/reddit-conversations/kernels
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 4, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jerry Qu
    Description

    Context

    I've been looking for an open-domain Conversational dataset for training chatbots. I was inspired by the work done by Google Brain in 'Towards a Human-like Open-Domain Chatbot'. While Transformers/BERT are trained on all of Wikipedia, chatbots need a dataset based on conversations.

    Content

    This data came from Reddit posts/comments under the r/CasualConversation subreddit. The conversations under this subreddit were significantly more 'conversation like' when compared to other subreddits (Ex. r/AskReddit). I'm currently looking for other subreddits to scrape.

    This dataset consists of 3 columns, where each row is a Length-3 conversation. For example:

    0 - What kind of phone(s) do you guys have? 1 - I have a pixel. It's pretty great. Much better than what I had before. 2 - Does it really charge all the way in 15 min?

    This data was collected between 2016-12-29 and 2019-12-31

    Furthermore, I have the full comment trees (stored as Python dictionaries), which was an intermediary step to creating this dataset. I plan to add more data in the future. (Ex. Longer sequence lengths, other subreddits)

    Acknowledgements / License

    Data was collected using Pushshift's API. https://pushshift.io/

    Currently unsure about licensing. Reddit does not appear to state a clear licensing agreement, while pushshift does not apply anything either.

    Inspiration

    1. Create an open-domain chatbot (Ex. Meena)
    2. I'd love to see how you can represent types of conversations and cluster them. This would be monumentally helpful in collecting more data. (Ex. AskReddit conversations don't resemble typically Person-To-Person conversations. How would you identify Person-To-Person-esk conversations? Perhaps cosine similarity between word-embeddings? Or between sentence-embeddings of POS tags may be very interesting.)
  6. Black Jack - Interactive Card Game

    • kaggle.com
    Updated Dec 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick L Ford (2024). Black Jack - Interactive Card Game [Dataset]. http://doi.org/10.34740/kaggle/dsv/10262142
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 21, 2024
    Dataset provided by
    Kaggle
    Authors
    Patrick L Ford
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Introduction

    Blackjack, also known as 21, is one of the most popular card games worldwide. Blackjack remains a favourite due to its mix of simplicity, luck, strategy, and fast paced game play, making it a staple in casinos.

    Objective of Blackjack:

    • The goal of Blackjack is to have a hand value closer to 21 than the dealer's hand, without exceeding 21. If a player's hand exceeds 21, they "bust" and lose the round.

    Card Values:

    • Number cards (2-10): These are worth their face value.
    • Face cards (Jack, Queen, King): Each is worth 10 points.
    • Ace: Can be worth either 1 or 11, depending on which value benefits the hand more without exceeding 21.

    Setup:

    • Deck: Blackjack is typically played with one to eight standard decks of 52 cards.
    • Players: One or more players compete against the dealer. Each player is dealt a separate hand, and players do not compete against each other.
    • Table Layout: The table features spaces for player bets, cards, and chips.

    Game Play:

    • Initial Bets:
      • Players place their bets in designated areas on the table.
    • Dealing Cards:
      • Each player and the dealer receive two cards.
      • Players' cards are dealt face-up, while the dealer gets one face-up card (up card) and one face-down card (hole card).
    • Player Options:
      • Hit: Request another card to add to their hand. Players can keep hitting until they are satisfied or bust.
      • Stand: Keep the current hand and end their turn.
      • Double Down: Double the initial bet and receive exactly one more card. Commonly allowed only on the first two cards.
      • Split: If the first two cards have the same rank, the player can split them into two separate hands by placing an additional bet equal to the original. Each hand is played separately.
      • Surrender (Optional Rule): Forfeit half the bet and end the turn. This is usually allowed only on the first two cards.
      • Insurance (Optional Rule): If the dealer's up card is an Ace, players may place a side bet (half the original bet) that the dealer has Blackjack. If the dealer has Blackjack, the insurance bet pays 2:1; otherwise, the player loses the insurance bet.
    • Dealer's Turn:
      • Hit until the hand value is 17 or higher.
      • Stand on 17 or higher (including "soft 17" in some variations).
      • The dealer does not have options; actions are automatic.
    • Winning:
      • Player Wins: The player's hand value is closer to 21 than the dealer's hand, or the dealer busts.
      • Dealer Wins: Dealer's hand value is closer to 21, or the player busts.
      • Push (Tie): Both hands have the same value; the player keeps their bet.
    • Blackjack (Natural):
      • If the player's initial two cards are an Ace and a 10-point card (Jack, Queen, King, or 10), they have a "Blackjack."
      • Blackjack typically pays 3:2 (e.g., a $10 bet wins $15).
      • If both the player and the dealer have Blackjack, it's a push.
    • House Edge and Strategy:

    The casino typically has a small edge due to rules favouring the dealer (e.g., the player acts first, so they can bust before the dealer plays): - Basic strategy can minimise the house edge: - Strategy charts show the optimal play based on the player's hand and the dealer's up card. - Advanced players use card counting to track high value cards remaining in the deck, gaining an advantage.

    Common Variations:

    • European Blackjack: Dealer receives only one card initially; no hole card until players complete their turns.
    • Spanish 21: Played with 48-card decks (no 10's), with bonuses for certain hands.
    • Pontoon: A British variation where "Five Card Trick" (five cards totalling 21 or less) is a winning hand.
    • Blackjack Switch: Players play two hands and can swap the second card between them.

    Etiquette and Tips:

    • Use hand signals to indicate actions (e.g., tapping for "hit," waving for "stand").
    • Avoid touching chips after the deal starts.
    • Familiarise yourself with table-specific rules and variations.

    Visualisation

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13231939%2Faa4b5d8819430e46c3203b3597666578%2FScreenshot%202024-12-21%2010.36.57.png?generation=1734781714095911&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13231939%2F86038e4d98f429825106bb2e8b5f74e8%2FScreenshot%202024-12-21%2010.38.18.png?generation=1734781738030008&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13231939%2F5b634959e2292840ce454745ca80062f%2FScreenshot%202024-12-21%2010.39.12.png?generation=1734781761032959&alt=media" alt="">

    A Markdown document with the R code for the game of Black Jack. link

    R Code

    The provided R code implements a simplified version of the game Blackjack. It includes f...

  7. Activity In R

    • kaggle.com
    zip
    Updated Aug 30, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manohar Reddy (2019). Activity In R [Dataset]. https://www.kaggle.com/datasets/manohar676/activity-in-r
    Explore at:
    zip(368 bytes)Available download formats
    Dataset updated
    Aug 30, 2019
    Authors
    Manohar Reddy
    Description

    Dataset

    This dataset was created by Manohar Reddy

    Contents

  8. R and Python Stack Overflow Answers + Sentiment

    • kaggle.com
    Updated May 28, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OJ Watson (2019). R and Python Stack Overflow Answers + Sentiment [Dataset]. https://www.kaggle.com/datasets/ojwatson/stack-overflow-output
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 28, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    OJ Watson
    Description

    Context

    This is the output of the Stack Rudeness kernel (https://www.kaggle.com/ojwatson/stack-rudeness), as saved in Cell 17.

    Content

    Stack Overflow answers by the Top 10 r and python users extracted using BigQuery. Also includes data on whether the answer was accepted and some additional data based on sentiment analysis of the answer text.

    Acknowledgements

    BigQuery and StackOverflow

  9. machine-learning-Python-R-in-data-science

    • kaggle.com
    Updated Jan 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ananto Yusuf Wicaksono (2020). machine-learning-Python-R-in-data-science [Dataset]. https://www.kaggle.com/datasets/ansufw/machinelearningpythonrindatascience/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 1, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ananto Yusuf Wicaksono
    Description

    Dataset

    This dataset was created by Ananto Yusuf Wicaksono

    Contents

  10. Medical Cost Personal Dataset

    • kaggle.com
    Updated Jul 17, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdel Homi (2020). Medical Cost Personal Dataset [Dataset]. https://www.kaggle.com/d3lhomi10/medical-cost-personal-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 17, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Abdel Homi
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Dataset

    This dataset was created by Abdel Homi

    Released under Database: Open Database, Contents: Database Contents

    Contents

  11. Submission R File

    • kaggle.com
    Updated Jan 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seth Lanza (2023). Submission R File [Dataset]. https://www.kaggle.com/datasets/sethlanza/submission-r-file
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 8, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Seth Lanza
    Description

    Dataset

    This dataset was created by Seth Lanza

    Contents

  12. forum-data-r-progamming-coursera

    • kaggle.com
    zip
    Updated Sep 9, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kelly Xu (2019). forum-data-r-progamming-coursera [Dataset]. https://www.kaggle.com/datasets/kkellyxfq/forumdatarprogammingcoursera
    Explore at:
    zip(425061 bytes)Available download formats
    Dataset updated
    Sep 9, 2019
    Authors
    Kelly Xu
    Description

    This file is for my postgraduate study. The data is concerned with the Coursera forum data of R Programming. All data has been anonymized for the purpose of data privacy.

    The data scaped is dated from September 2018 to September 2019.

  13. Using R to get data from Twitter and Binance

    • kaggle.com
    Updated Nov 3, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Medou Neine (2019). Using R to get data from Twitter and Binance [Dataset]. https://www.kaggle.com/dodu63/using-r-to-get-data-from-twitter-and-binance/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 3, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Medou Neine
    Description

    Dataset

    This dataset was created by Medou Neine

    Contents

  14. Fruits-360 dataset

    • kaggle.com
    • paperswithcode.com
    • +1more
    Updated Jun 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mihai Oltean (2025). Fruits-360 dataset [Dataset]. https://www.kaggle.com/datasets/moltean/fruits
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 7, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mihai Oltean
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Fruits-360 dataset: A dataset of images containing fruits, vegetables, nuts and seeds

    Version: 2025.06.07.0

    Content

    The following fruits, vegetables and nuts and are included: Apples (different varieties: Crimson Snow, Golden, Golden-Red, Granny Smith, Pink Lady, Red, Red Delicious), Apricot, Avocado, Avocado ripe, Banana (Yellow, Red, Lady Finger), Beans, Beetroot Red, Blackberry, Blueberry, Cabbage, Caju seed, Cactus fruit, Cantaloupe (2 varieties), Carambula, Carrot, Cauliflower, Cherimoya, Cherry (different varieties, Rainier), Cherry Wax (Yellow, Red, Black), Chestnut, Clementine, Cocos, Corn (with husk), Cucumber (ripened, regular), Dates, Eggplant, Fig, Ginger Root, Goosberry, Granadilla, Grape (Blue, Pink, White (different varieties)), Grapefruit (Pink, White), Guava, Hazelnut, Huckleberry, Kiwi, Kaki, Kohlrabi, Kumsquats, Lemon (normal, Meyer), Lime, Lychee, Mandarine, Mango (Green, Red), Mangostan, Maracuja, Melon Piel de Sapo, Mulberry, Nectarine (Regular, Flat), Nut (Forest, Pecan), Onion (Red, White), Orange, Papaya, Passion fruit, Peach (different varieties), Pepino, Pear (different varieties, Abate, Forelle, Kaiser, Monster, Red, Stone, Williams), Pepper (Red, Green, Orange, Yellow), Physalis (normal, with Husk), Pineapple (normal, Mini), Pistachio, Pitahaya Red, Plum (different varieties), Pomegranate, Pomelo Sweetie, Potato (Red, Sweet, White), Quince, Rambutan, Raspberry, Redcurrant, Salak, Strawberry (normal, Wedge), Tamarillo, Tangelo, Tomato (different varieties, Maroon, Cherry Red, Yellow, not ripened, Heart), Walnut, Watermelon, Zucchini (green and dark).

    Branches

    The dataset has 5 major branches:

    -The 100x100 branch, where all images have 100x100 pixels. See _fruits-360_100x100_ folder.

    -The original-size branch, where all images are at their original (captured) size. See _fruits-360_original-size_ folder.

    -The meta branch, which contains additional information about the objects in the Fruits-360 dataset. See _fruits-360_dataset_meta_ folder.

    -The multi branch, which contains images with multiple fruits, vegetables, nuts and seeds. These images are not labeled. See _fruits-360_multi_ folder.

    -The _3_body_problem_ branch where the Training and Test folders contain different (varieties of) the 3 fruits and vegetables (Apples, Cherries and Tomatoes). See _fruits-360_3-body-problem_ folder.

    How to cite

    Mihai Oltean, Fruits-360 dataset, 2017-

    Dataset properties

    For the 100x100 branch

    Total number of images: 138704.

    Training set size: 103993 images.

    Test set size: 34711 images.

    Number of classes: 206 (fruits, vegetables, nuts and seeds).

    Image size: 100x100 pixels.

    For the original-size branch

    Total number of images: 58363.

    Training set size: 29222 images.

    Validation set size: 14614 images

    Test set size: 14527 images.

    Number of classes: 90 (fruits, vegetables, nuts and seeds).

    Image size: various (original, captured, size) pixels.

    For the 3-body-problem branch

    Total number of images: 47033.

    Training set size: 34800 images.

    Test set size: 12233 images.

    Number of classes: 3 (Apples, Cherries, Tomatoes).

    Number of varieties: Apples = 29; Cherries = 12; Tomatoes = 19.

    Image size: 100x100 pixels.

    For the meta branch

    Number of classes: 26 (fruits, vegetables, nuts and seeds).

    For the multi branch

    Number of images: 150.

    Filename format:

    For the 100x100 branch

    image_index_100.jpg (e.g. 31_100.jpg) or

    r_image_index_100.jpg (e.g. r_31_100.jpg) or

    r?_image_index_100.jpg (e.g. r2_31_100.jpg)

    where "r" stands for rotated fruit. "r2" means that the fruit was rotated around the 3rd axis. "100" comes from image size (100x100 pixels).

    Different varieties of the same fruit (apple, for instance) are stored as belonging to different classes.

    For the original-size branch

    r?_image_index.jpg (e.g. r2_31.jpg)

    where "r" stands for rotated fruit. "r2" means that the fruit was rotated around the 3rd axis.

    The name of the image files in the new version does NOT contain the "_100" suffix anymore. This will help you to make the distinction between the original-size branch and the 100x100 branch.

    For the multi branch

    The file's name is the concatenation of the names of the fruits inside that picture.

    Alternate download

    The Fruits-360 dataset can be downloaded from:

    Kaggle https://www.kaggle.com/moltean/fruits

    GitHub https://github.com/fruits-360

    How fruits were filmed

    Fruits and vegetables were planted in the shaft of a low-speed motor (3 rpm) and a short movie of 20 seconds was recorded.

    A Logitech C920 camera was used for filming the fruits. This is one of the best webcams available.

    Behind the fruits, we placed a white sheet of paper as a background.

    Here i...

  15. Data from: Data Mining Using R:

    • kaggle.com
    Updated Jul 2, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Science (2018). Data Mining Using R: [Dataset]. https://www.kaggle.com/ravali566/data-mining-using-r/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 2, 2018
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Data Science
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Data Science

    Released under CC0: Public Domain

    Contents

  16. Road-R Dataset Sample

    • kaggle.com
    Updated Aug 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sciencestoked (2023). Road-R Dataset Sample [Dataset]. https://www.kaggle.com/datasets/sciencestoked/road-r-dataset-sample/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 17, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    sciencestoked
    Description

    Dataset

    This dataset was created by sciencestoked

    Contents

  17. May 2015 Reddit Comments

    • kaggle.com
    zip
    Updated Jun 4, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaggle (2019). May 2015 Reddit Comments [Dataset]. https://www.kaggle.com/datasets/kaggle/reddit-comments-may-2015
    Explore at:
    zip(21429083286 bytes)Available download formats
    Dataset updated
    Jun 4, 2019
    Dataset authored and provided by
    Kagglehttp://kaggle.com/
    License

    https://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api

    Description

    Recently Reddit released an enormous dataset containing all ~1.7 billion of their publicly available comments. The full dataset is an unwieldy 1+ terabyte uncompressed, so we've decided to host a small portion of the comments here for Kagglers to explore. (You don't even need to leave your browser!)

    You can find all the comments from May 2015 on scripts for your natural language processing pleasure. What had redditors laughing, bickering, and NSFW-ing this spring?

    Who knows? Top visualizations may just end up on Reddit.

    Data Description

    The database has one table, May2015, with the following fields:

    • created_utc
    • ups
    • subreddit_id
    • link_id
    • name
    • score_hidden
    • author_flair_css_class
    • author_flair_text
    • subreddit
    • id
    • removal_reason
    • gilded
    • downs
    • archived
    • author
    • score
    • retrieved_on
    • body
    • distinguished
    • edited
    • controversiality
    • parent_id
  18. Top 10 R and Python Stack Overflow User Answers

    • kaggle.com
    Updated May 28, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OJ Watson (2019). Top 10 R and Python Stack Overflow User Answers [Dataset]. https://www.kaggle.com/ojwatson/stack-answers-r-python/activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 28, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    OJ Watson
    Description

    Context

    This is the input data available for the Stack Rudeness kernel (https://www.kaggle.com/ojwatson/stack-rudeness).

    Content

    Stack Overflow answers by the Top 10 r and python users extracted using BigQuery. Also includes data on whether the answer was accepted downloaded from the Stack Overflow API.

    Acknowledgements

    BigQuery and StackOverflow

  19. igraph in R

    • kaggle.com
    Updated Oct 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vashu Gupta (2021). igraph in R [Dataset]. https://www.kaggle.com/datasets/vashugupta0298/igraph-in-r
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 19, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Vashu Gupta
    Description

    Dataset

    This dataset was created by Vashu Gupta

    Contents

  20. Survival Prediction with Titanic Dataset using R

    • kaggle.com
    zip
    Updated Jan 26, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sivasuryanarayan Krishnamoorthy (2018). Survival Prediction with Titanic Dataset using R [Dataset]. https://www.kaggle.com/sivasuryak3/survival-prediction-with-titanic-dataset-using-r
    Explore at:
    zip(33847 bytes)Available download formats
    Dataset updated
    Jan 26, 2018
    Authors
    Sivasuryanarayan Krishnamoorthy
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Sivasuryanarayan Krishnamoorthy

    Released under CC0: Public Domain

    Contents

    It contains the following files:

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
Organization logo

Meta Kaggle Code

Kaggle's public data on notebook code

Explore at:
4 scholarly articles cite this dataset (View in Google Scholar)
zip(148301844275 bytes)Available download formats
Dataset updated
Jul 10, 2025
Dataset authored and provided by
Kagglehttp://kaggle.com/
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Explore our public notebook content!

Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

Why we’re releasing this dataset

By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

Sensitive data

While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

Joining with Meta Kaggle

The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

File organization

The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

Questions / Comments

We love feedback! Let us know in the Discussion tab.

Happy Kaggling!

Search
Clear search
Close search
Google apps
Main menu