100+ datasets found
  1. Z

    Data from: KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Quaranta, Luigi (2024). KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4468522
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Quaranta, Luigi
    Calefato, Fabio
    Lanubile, Filippo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    KGTorrent is a dataset of Python Jupyter notebooks from the Kaggle platform.

    The dataset is accompanied by a MySQL database containing metadata about the notebooks and the activity of Kaggle users on the platform. The information to build the MySQL database has been derived from Meta Kaggle, a publicly available dataset containing Kaggle metadata.

    In this package, we share the complete KGTorrent dataset (consisting of the dataset itself plus its companion database), as well as the specific version of Meta Kaggle used to build the database.

    More specifically, the package comprises the following three compressed archives:

    KGT_dataset.tar.bz2, the dataset of Jupyter notebooks;

    KGTorrent_dump_10-2020.sql.tar.bz2, the dump of the MySQL companion database;

    MetaKaggle27Oct2020.tar.bz2, a copy of the Meta Kaggle version used to build the database.

    Moreover, we include KGTorrent_logical_schema.pdf, the logical schema of the KGTorrent MySQL database.

  2. issues-kaggle-notebooks

    • huggingface.co
    Updated Jul 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face Smol Models Research (2025). issues-kaggle-notebooks [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks
    Explore at:
    Dataset updated
    Jul 8, 2025
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face Smol Models Research
    Description

    GitHub Issues & Kaggle Notebooks

      Description
    

    GitHub Issues & Kaggle Notebooks is a collection of two code datasets intended for language models training, they are sourced from GitHub issues and notebooks in Kaggle platform. These datasets are a modified part of the StarCoder2 model training corpus, precisely the bigcode/StarCoder2-Extras dataset. We reformat the samples to remove StarCoder2's special tokens and use natural text to delimit comments in issues and display… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks.

  3. b

    Kaggle

    • bioregistry.io
    Updated Mar 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Kaggle [Dataset]. http://identifiers.org/re3data:r3d100012705
    Explore at:
    Dataset updated
    Mar 18, 2022
    Description

    Kaggle is a platform for sharing data, performing reproducible analyses, interactive data analysis tutorials, and machine learning competitions.

  4. Customer360Insights

    • kaggle.com
    Updated Jun 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dave Darshan (2024). Customer360Insights [Dataset]. https://www.kaggle.com/datasets/davedarshan/customer360insights
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 9, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Dave Darshan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Customer360Insights

    The Customer360Insights dataset is a synthetic collection meticulously designed to mirror the multifaceted nature of customer interactions within an e-commerce platform. It encompasses a wide array of variables, each serving as a pillar to support various analytical explorations. Here’s a breakdown of the dataset and the potential analyses it enables:

    Dataset Description

    • Customer Demographics: Includes FullName, Gender, Age, CreditScore, and MonthlyIncome. These variables provide a demographic snapshot of the customer base, allowing for segmentation and targeted marketing analysis.
    • Geographical Data: Comprising Country, State, and City, this section facilitates location-based analytics, market penetration studies, and regional sales performance.
    • Product Information: Details like Category, Product, Cost, and Price enable product trend analysis, profitability assessment, and inventory optimization.
    • Transactional Data: Captures the customer journey through SessionStart, CartAdditionTime, OrderConfirmation, OrderConfirmationTime, PaymentMethod, and SessionEnd. This rich temporal data can be used for funnel analysis, conversion rate optimization, and customer behavior modeling.
    • Post-Purchase Details: With OrderReturn and ReturnReason, analysts can delve into return rate calculations, post-purchase satisfaction, and quality control.

    Types of Analysis

    • Descriptive Analytics: Understand basic metrics like average monthly income, most common product categories, and typical credit scores.
    • Predictive Analytics: Use machine learning to predict credit risk or the likelihood of a purchase based on demographics and session activity.
    • Customer Segmentation: Group customers by demographics or purchasing behavior to tailor marketing strategies.
    • Geospatial Analysis: Examine sales distribution across different regions and optimize logistics. Time Series Analysis: Study the seasonality of purchases and session activities over time.
    • Funnel Analysis: Evaluate the customer journey from session start to order confirmation and identify drop-off points.
    • Cohort Analysis: Track customer cohorts over time to understand retention and repeat purchase patterns.
    • Market Basket Analysis: Discover product affinities and develop cross-selling strategies.

    This dataset is a playground for data enthusiasts to practice cleaning, transforming, visualizing, and modeling data. Whether you’re conducting A/B testing for marketing campaigns, forecasting sales, or building customer profiles, Customer360Insights offers a rich, realistic dataset for honing your data science skills.

    Curious about how I created the data? Feel free to click here and take a peek! 😉

    📊🔍 Good Luck and Happy Analysing 🔍📊

  5. A

    ‘Top 1000 Kaggle Datasets’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Top 1000 Kaggle Datasets’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-top-1000-kaggle-datasets-658b/b992f64b/?iid=004-553&v=presentation
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Top 1000 Kaggle Datasets’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/notkrishna/top-1000-kaggle-datasets on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    From wiki

    Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.

    Kaggle got its start in 2010 by offering machine learning competitions and now also offers a public data platform, a cloud-based workbench for data science, and Artificial Intelligence education. Its key personnel were Anthony Goldbloom and Jeremy Howard. Nicholas Gruen was founding chair succeeded by Max Levchin. Equity was raised in 2011 valuing the company at $25 million. On 8 March 2017, Google announced that they were acquiring Kaggle.[1][2]

    Source: Kaggle

    --- Original source retains full ownership of the source dataset ---

  6. FSDKaggle2018

    • zenodo.org
    • opendatalab.com
    • +2more
    zip
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eduardo Fonseca; Eduardo Fonseca; Xavier Favory; Jordi Pons; Frederic Font; Frederic Font; Manoj Plakal; Daniel P. W. Ellis; Daniel P. W. Ellis; Xavier Serra; Xavier Serra; Xavier Favory; Jordi Pons; Manoj Plakal (2020). FSDKaggle2018 [Dataset]. http://doi.org/10.5281/zenodo.2552860
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Eduardo Fonseca; Eduardo Fonseca; Xavier Favory; Jordi Pons; Frederic Font; Frederic Font; Manoj Plakal; Daniel P. W. Ellis; Daniel P. W. Ellis; Xavier Serra; Xavier Serra; Xavier Favory; Jordi Pons; Manoj Plakal
    Description

    FSDKaggle2018 is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology. FSDKaggle2018 has been used for the DCASE Challenge 2018 Task 2, which was run as a Kaggle competition titled Freesound General-Purpose Audio Tagging Challenge.

    Citation

    If you use the FSDKaggle2018 dataset or part of it, please cite our DCASE 2018 paper:

    Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Favory, Jordi Pons, Xavier Serra. "General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline". Proceedings of the DCASE 2018 Workshop (2018)

    You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2018.

    Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

    Contact

    You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

    About this dataset

    Freesound Dataset Kaggle 2018 (or FSDKaggle2018 for short) is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology [1]. FSDKaggle2018 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2018. Please visit the DCASE2018 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound General-Purpose Audio Tagging Challenge. It was organized by researchers from the Music Technology Group of Universitat Pompeu Fabra, and from Google Research’s Machine Perception Team.

    The goal of this competition was to build an audio tagging system that can categorize an audio clip as belonging to one of a set of 41 diverse categories drawn from the AudioSet Ontology.

    All audio samples in this dataset are gathered from Freesound [2] and are provided here as uncompressed PCM 16 bit, 44.1 kHz, mono audio files. Note that because Freesound content is collaboratively contributed, recording quality and techniques can vary widely.

    The ground truth data provided in this dataset has been obtained after a data labeling process which is described below in the Data labeling process section. FSDKaggle2018 clips are unequally distributed in the following 41 categories of the AudioSet Ontology:

    "Acoustic_guitar", "Applause", "Bark", "Bass_drum", "Burping_or_eructation", "Bus", "Cello", "Chime", "Clarinet", "Computer_keyboard", "Cough", "Cowbell", "Double_bass", "Drawer_open_or_close", "Electric_piano", "Fart", "Finger_snapping", "Fireworks", "Flute", "Glockenspiel", "Gong", "Gunshot_or_gunfire", "Harmonica", "Hi-hat", "Keys_jangling", "Knock", "Laughter", "Meow", "Microwave_oven", "Oboe", "Saxophone", "Scissors", "Shatter", "Snare_drum", "Squeak", "Tambourine", "Tearing", "Telephone", "Trumpet", "Violin_or_fiddle", "Writing".

    Some other relevant characteristics of FSDKaggle2018:

    • The dataset is split into a train set and a test set.

    • The train set is meant to be for system development and includes ~9.5k samples unequally distributed among 41 categories. The minimum number of audio samples per category in the train set is 94, and the maximum 300. The duration of the audio samples ranges from 300ms to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording sounds. The total duration of the train set is roughly 18h.

    • Out of the ~9.5k samples from the train set, ~3.7k have manually-verified ground truth annotations and ~5.8k have non-verified annotations. The non-verified annotations of the train set have a quality estimate of at least 65-70% in each category. Checkout the Data labeling process section below for more information about this aspect.

    • Non-verified annotations in the train set are properly flagged in train.csv so that participants can opt to use this information during the development of their systems.

    • The test set is composed of 1.6k samples with manually-verified annotations and with a similar category distribution than that of the train set. The total duration of the test set is roughly 2h.

    • All audio samples in this dataset have a single label (i.e. are only annotated with one label). Checkout the Data labeling process section below for more information about this aspect. A single label should be predicted for each file in the test set.

    Data labeling process

    The data labeling process started from a manual mapping between Freesound tags and AudioSet Ontology categories (or labels), which was carried out by researchers at the Music Technology Group, Universitat Pompeu Fabra, Barcelona. Using this mapping, a number of Freesound audio samples were automatically annotated with labels from the AudioSet Ontology. These annotations can be understood as weak labels since they express the presence of a sound category in an audio sample.

    Then, a data validation process was carried out in which a number of participants did listen to the annotated sounds and manually assessed the presence/absence of an automatically assigned sound category, according to the AudioSet category description.

    Audio samples in FSDKaggle2018 are only annotated with a single ground truth label (see train.csv). A total of 3,710 annotations included in the train set of FSDKaggle2018 are annotations that have been manually validated as present and predominant (some with inter-annotator agreement but not all of them). This means that in most cases there is no additional acoustic material other than the labeled category. In few cases there may be some additional sound events, but these additional events won't belong to any of the 41 categories of FSDKaggle2018.

    The rest of the annotations have not been manually validated and therefore some of them could be inaccurate. Nonetheless, we have estimated that at least 65-70% of the non-verified annotations per category in the train set are indeed correct. It can happen that some of these non-verified audio samples present several sound sources even though only one label is provided as ground truth. These additional sources are typically out of the set of the 41 categories, but in a few cases they could be within.

    More details about the data labeling process can be found in [3].

    License

    FSDKaggle2018 has licenses at two different levels, as explained next.

    All sounds in Freesound are released under Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound. For attribution purposes and to facilitate attribution of these files to third parties, we include a relation of the audio clips included in FSDKaggle2018 and their corresponding license. The licenses are specified in the files train_post_competition.csv and test_post_competition_scoring_clips.csv.

    In addition, FSDKaggle2018 as a whole is the result of a curation process and it has an additional license. FSDKaggle2018 is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSDKaggle2018.doc zip file.

    Files

    FSDKaggle2018 can be downloaded as a series of zip files with the following directory structure:

    root
    │
    └───FSDKaggle2018.audio_train/ Audio clips in the train set │
    └───FSDKaggle2018.audio_test/ Audio clips in the test set │
    └───FSDKaggle2018.meta/ Files for evaluation setup │ │
    │ └───train_post_competition.csv Data split and ground truth for the train set │ │
    │ └───test_post_competition_scoring_clips.csv Ground truth for the test set

    └───FSDKaggle2018.doc/ │
    └───README.md The dataset description file you are reading │
    └───LICENSE-DATASET

  7. FSDKaggle2019

    • zenodo.org
    • explore.openaire.eu
    • +1more
    bin, zip
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eduardo Fonseca; Eduardo Fonseca; Manoj Plakal; Frederic Font; Frederic Font; Daniel P. W. Ellis; Daniel P. W. Ellis; Xavier Serra; Xavier Serra; Manoj Plakal (2020). FSDKaggle2019 [Dataset]. http://doi.org/10.5281/zenodo.3612637
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Eduardo Fonseca; Eduardo Fonseca; Manoj Plakal; Frederic Font; Frederic Font; Daniel P. W. Ellis; Daniel P. W. Ellis; Xavier Serra; Xavier Serra; Manoj Plakal
    Description

    FSDKaggle2019 is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology. FSDKaggle2019 has been used for the DCASE Challenge 2019 Task 2, which was run as a Kaggle competition titled Freesound Audio Tagging 2019.

    Citation

    If you use the FSDKaggle2019 dataset or part of it, please cite our DCASE 2019 paper:

    Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Serra. "Audio tagging with noisy labels and minimal supervision". Proceedings of the DCASE 2019 Workshop, NYC, US (2019)

    You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2019.

    Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

    Data curators

    Eduardo Fonseca, Manoj Plakal, Xavier Favory, Jordi Pons

    Contact

    You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

    ABOUT FSDKaggle2019

    Freesound Dataset Kaggle 2019 (or FSDKaggle2019 for short) is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology [1]. FSDKaggle2019 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2019. Please visit the DCASE2019 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound Audio Tagging 2019. It was organized by researchers from the Music Technology Group (MTG) of Universitat Pompeu Fabra (UPF), and from Sound Understanding team at Google AI Perception. The competition intended to provide insight towards the development of broadly-applicable sound event classifiers able to cope with label noise and minimal supervision conditions.

    FSDKaggle2019 employs audio clips from the following sources:

    1. Freesound Dataset (FSD): a dataset being collected at the MTG-UPF based on Freesound content organized with the AudioSet Ontology
    2. The soundtracks of a pool of Flickr videos taken from the Yahoo Flickr Creative Commons 100M dataset (YFCC)

    The audio data is labeled using a vocabulary of 80 labels from Google’s AudioSet Ontology [1], covering diverse topics: Guitar and other Musical Instruments, Percussion, Water, Digestive, Respiratory sounds, Human voice, Human locomotion, Hands, Human group actions, Insect, Domestic animals, Glass, Liquid, Motor vehicle (road), Mechanisms, Doors, and a variety of Domestic sounds. The full list of categories can be inspected in vocabulary.csv (see Files & Download below). The goal of the task was to build a multi-label audio tagging system that can predict appropriate label(s) for each audio clip in a test set.

    What follows is a summary of some of the most relevant characteristics of FSDKaggle2019. Nevertheless, it is highly recommended to read our DCASE 2019 paper for a more in-depth description of the dataset and how it was built.

    Ground Truth Labels

    The ground truth labels are provided at the clip-level, and express the presence of a sound category in the audio clip, hence can be considered weak labels or tags. Audio clips have variable lengths (roughly from 0.3 to 30s).

    The audio content from FSD has been manually labeled by humans following a data labeling process using the Freesound Annotator platform. Most labels have inter-annotator agreement but not all of them. More details about the data labeling process and the Freesound Annotator can be found in [2].

    The YFCC soundtracks were labeled using automated heuristics applied to the audio content and metadata of the original Flickr clips. Hence, a substantial amount of label noise can be expected. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises. More information about some of the types of label noise that can be encountered is available in [3].

    Specifically, FSDKaggle2019 features three types of label quality, one for each set in the dataset:

    • curated train set: correct (but potentially incomplete) labels
    • noisy train set: noisy labels
    • test set: correct and complete labels

    Further details can be found below in the sections for each set.

    Format

    All audio clips are provided as uncompressed PCM 16 bit, 44.1 kHz, mono audio files.

    DATA SPLIT

    FSDKaggle2019 consists of two train sets and one test set. The idea is to limit the supervision provided for training (i.e., the manually-labeled, hence reliable, data), thus promoting approaches to deal with label noise.

    Curated train set

    The curated train set consists of manually-labeled data from FSD.

    • Number of clips/class: 75 except in a few cases (where there are less)
    • Total number of clips: 4970
    • Avg number of labels/clip: 1.2
    • Total duration: 10.5 hours

    The duration of the audio clips ranges from 0.3 to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording/uploading sounds. Labels are correct but potentially incomplete. It can happen that a few of these audio clips present additional acoustic material beyond the provided ground truth label(s).

    Noisy train set

    The noisy train set is a larger set of noisy web audio data from Flickr videos taken from the YFCC dataset [5].

    • Number of clips/class: 300
    • Total number of clips: 19,815
    • Avg number of labels/clip: 1.2
    • Total duration: ~80 hours

    The duration of the audio clips ranges from 1s to 15s, with the vast majority lasting 15s. Labels are automatically generated and purposefully noisy. No human validation is involved. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises.

    Considering the numbers above, the per-class data distribution available for training is, for most of the classes, 300 clips from the noisy train set and 75 clips from the curated train set. This means 80% noisy / 20% curated at the clip level, while at the duration level the proportion is more extreme considering the variable-length clips.

    Test set

    The test set is used for system evaluation and consists of manually-labeled data from FSD.

    • Number of clips/class: between 50 and 150
    • Total number of clips: 4481
    • Avg number of labels/clip: 1.4
    • Total duration: 12.9 hours

    The acoustic material present in the test set clips is labeled exhaustively using the aforementioned vocabulary of 80 classes. Most labels have inter-annotator agreement but not all of them. Except human error, the label(s) are correct and complete considering the target vocabulary; nonetheless, a few clips could still present additional (unlabeled) acoustic content out of the vocabulary.

    During the DCASE2019 Challenge Task 2, the test set was split into two subsets, for the public and private leaderboards, and only the data corresponding to the public leaderboard was provided. In this current package you will find the full test set with all the test labels. To allow comparison with previous work, the file test_post_competition.csv includes a flag to determine the corresponding leaderboard (public

  8. Deep Lit

    • kaggle.com
    zip
    Updated Aug 11, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jay2K (2019). Deep Lit [Dataset]. https://www.kaggle.com/datasets/jk20191105/deep-lit
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Aug 11, 2019
    Authors
    Jay2K
    Description

    Dataset

    This dataset was created by Jay2K

    Contents

  9. Titanic_Subset

    • kaggle.com
    Updated Feb 28, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jiuzhang (2018). Titanic_Subset [Dataset]. https://www.kaggle.com/jiuzhang/titanic-subset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 28, 2018
    Dataset provided by
    Kaggle
    Authors
    jiuzhang
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    泰塔尼克号数据集的子集

    Content

    1313个数据,11列特征

    Acknowledgements

    Kaggle platform

    Inspiration

    For spreading machine learning basic knowledge

  10. Mentorship Platform

    • kaggle.com
    Updated Apr 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayush Khubchandani (2025). Mentorship Platform [Dataset]. https://www.kaggle.com/datasets/ayushkhubchandani/mentorship-platform/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 8, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ayush Khubchandani
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Ayush Khubchandani

    Released under MIT

    Contents

  11. E-Commerce Retail Sales Series Data Collection

    • kaggle.com
    Updated Dec 7, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    US Census Bureau (2019). E-Commerce Retail Sales Series Data Collection [Dataset]. https://www.kaggle.com/datasets/census/e-commerce-retail-sales-series-data-collection
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 7, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    US Census Bureau
    Description

    Content

    More details about each file are in the individual file descriptions.

    Context

    This is a dataset from the U.S. Census Bureau hosted by the Federal Reserve Economic Database (FRED). FRED has a data platform found here and they update their information according the amount of data that is brought in. Explore the U.S. Census Bureau using Kaggle and all of the data sources available through the U.S. Census Bureau organization page!

    • Update Frequency: This dataset is updated daily.

    Acknowledgements

    This dataset is maintained using FRED's API and Kaggle's API.

  12. f

    Comparison results of different model.

    • plos.figshare.com
    xls
    Updated Dec 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ke Peng; Yan Peng; Wenguang Li (2023). Comparison results of different model. [Dataset]. http://doi.org/10.1371/journal.pone.0289724.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 8, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Ke Peng; Yan Peng; Wenguang Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In recent years, with the continuous improvement of the financial system and the rapid development of the banking industry, the competition of the banking industry itself has intensified. At the same time, with the rapid development of information technology and Internet technology, customers’ choice of financial products is becoming more and more diversified, and customers’ dependence and loyalty to banking institutions is becoming less and less, and the problem of customer churn in commercial banks is becoming more and more prominent. How to predict customer behavior and retain existing customers has become a major challenge for banks to solve. Therefore, this study takes a bank’s business data on Kaggle platform as the research object, uses multiple sampling methods to compare the data for balancing, constructs a bank customer churn prediction model for churn identification by GA-XGBoost, and conducts interpretability analysis on the GA-XGBoost model to provide decision support and suggestions for the banking industry to prevent customer churn. The results show that: (1) The applied SMOTEENN is more effective than SMOTE and ADASYN in dealing with the imbalance of banking data. (2) The F1 and AUC values of the model improved and optimized by XGBoost using genetic algorithm can reach 90% and 99%, respectively, which are optimal compared to other six machine learning models. The GA-XGBoost classifier was identified as the best solution for the customer churn problem. (3) Using Shapley values, we explain how each feature affects the model results, and analyze the features that have a high impact on the model prediction, such as the total number of transactions in the past year, the amount of transactions in the past year, the number of products owned by customers, and the total sales balance. The contribution of this paper is mainly in two aspects: (1) this study can provide useful information from the black box model based on the accurate identification of churned customers, which can provide reference for commercial banks to improve their service quality and retain customers; (2) it can provide reference for customer churn early warning models of other related industries, which can help the banking industry to maintain customer stability, maintain market position and reduce corporate losses.

  13. C

    Community-Driven Model Service Platform Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Jun 4, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Community-Driven Model Service Platform Report [Dataset]. https://www.datainsightsmarket.com/reports/community-driven-model-service-platform-507803
    Explore at:
    pdf, doc, pptAvailable download formats
    Dataset updated
    Jun 4, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The community-driven model service platform market is experiencing robust growth, projected to reach $35.14 billion in 2025 and expanding at a compound annual growth rate (CAGR) of 10.1% from 2025 to 2033. This surge is driven by several key factors. The increasing accessibility of machine learning models, fueled by platforms like Kaggle, GitHub, and Hugging Face, is lowering the barrier to entry for developers and researchers. The collaborative nature of these platforms fosters innovation and accelerates model development, leading to a wider adoption of AI solutions across various industries. Furthermore, the growing demand for specialized and customized AI models is pushing businesses to leverage community-driven platforms, where they can find pre-trained models or collaborate on developing tailored solutions, thereby reducing development time and costs. The trend towards open-source models and the rise of model zoos contribute significantly to this market expansion. While challenges exist, such as ensuring model quality, security, and addressing potential biases, the overall market trajectory remains strongly positive. The market's segmentation likely includes various model types (e.g., image recognition, natural language processing, time series analysis), deployment options (cloud-based, on-premise), and target industries (healthcare, finance, retail). Leading players, such as Kaggle, GitHub, Hugging Face, TensorFlow Hub, Model Zoo, DrivenData, and Cortex, are actively shaping the market landscape through continuous innovation and community engagement. The geographical distribution of the market is likely to reflect the global concentration of AI expertise and technological infrastructure, with regions like North America and Europe holding significant market shares initially, followed by rapid expansion in Asia and other developing regions as digital infrastructure improves. Future growth will hinge on continued technological advancements, further integration with cloud platforms, and the development of robust governance frameworks to address ethical concerns surrounding AI model development and deployment.

  14. A

    ‘Video Game Sales and Ratings’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Feb 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Video Game Sales and Ratings’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-video-game-sales-and-ratings-0c41/c2aaa1eb/?iid=006-219&v=presentation
    Explore at:
    Dataset updated
    Feb 13, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Video Game Sales and Ratings’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/kendallgillies/video-game-sales-and-ratings on 13 February 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    This data set contains a list of video games with sales greater than 100,000 copies along with critic and user ratings. It is a combined web scrape from VGChartz and Metacritic along with manually entered year of release values for most games with a missing year of release. The original coding was created by Rush Kirubi and can be found here, but it limited the data to only include a subset of video game platforms. Not all of the listed video games have information on Metacritic, so there data set does have missing values.

    Content

    The fields include:

    • Name - The game's name
    • Platform - Platform of the games release
    • Year_of_Release - Year of the game's release
    • Genre - Genre of the game
    • Publisher - Publisher of the game
    • NA_Sales - Sales in North America (in millions)
    • EU_Sales - Sales in Europe (in millions)
    • JP_Sales - Sales in Japan (in millions)
    • Other_Sales - Sales in the rest of the world (in millions)
    • Global_Sales - Total worldwide sales (in millions)
    • Critic_score - Aggregate score compiled by Metacritic staff
    • Critic_count - The number of critics used in coming up with the critic score
    • User_score - Score by Metacritic's subscribers
    • User_count - Number of users who gave the user score
    • Rating - The ESRB ratings

    Acknowledgements

    Again the main credit behind this data set goes to Rush Kirubi. I just commented out two lines of his code.

    Also the original inspiration for this data set came from Gregory Smith who originally scraped the data from VGChartz, it can be found here.

    --- Original source retains full ownership of the source dataset ---

  15. Video Game Sales

    • kaggle.com
    zip
    Updated Oct 26, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GregorySmith (2016). Video Game Sales [Dataset]. https://www.kaggle.com/gregorut/videogamesales
    Explore at:
    zip(390286 bytes)Available download formats
    Dataset updated
    Oct 26, 2016
    Authors
    GregorySmith
    Description

    This dataset contains a list of video games with sales greater than 100,000 copies. It was generated by a scrape of vgchartz.com.

    Fields include

    • Rank - Ranking of overall sales

    • Name - The games name

    • Platform - Platform of the games release (i.e. PC,PS4, etc.)

    • Year - Year of the game's release

    • Genre - Genre of the game

    • Publisher - Publisher of the game

    • NA_Sales - Sales in North America (in millions)

    • EU_Sales - Sales in Europe (in millions)

    • JP_Sales - Sales in Japan (in millions)

    • Other_Sales - Sales in the rest of the world (in millions)

    • Global_Sales - Total worldwide sales.

    The script to scrape the data is available at https://github.com/GregorUT/vgchartzScrape. It is based on BeautifulSoup using Python. There are 16,598 records. 2 records were dropped due to incomplete information.

  16. A

    ‘Video Game Sales’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Nov 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Video Game Sales’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-video-game-sales-30b0/092867fa/?iid=010-909&v=presentation
    Explore at:
    Dataset updated
    Nov 20, 2021
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Video Game Sales’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/gregorut/videogamesales on 12 November 2021.

    --- Dataset description provided by original source is as follows ---

    This dataset contains a list of video games with sales greater than 100,000 copies. It was generated by a scrape of vgchartz.com.

    Fields include

    • Rank - Ranking of overall sales

    • Name - The games name

    • Platform - Platform of the games release (i.e. PC,PS4, etc.)

    • Year - Year of the game's release

    • Genre - Genre of the game

    • Publisher - Publisher of the game

    • NA_Sales - Sales in North America (in millions)

    • EU_Sales - Sales in Europe (in millions)

    • JP_Sales - Sales in Japan (in millions)

    • Other_Sales - Sales in the rest of the world (in millions)

    • Global_Sales - Total worldwide sales.

    The script to scrape the data is available at https://github.com/GregorUT/vgchartzScrape. It is based on BeautifulSoup using Python. There are 16,598 records. 2 records were dropped due to incomplete information.

    --- Original source retains full ownership of the source dataset ---

  17. f

    Data characteristics for the Kaggle.com seizure forecasting contest.

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francisco Javier Muñoz-Almaraz; Francisco Zamora-Martínez; Paloma Botella-Rocamora; Juan Pardo (2023). Data characteristics for the Kaggle.com seizure forecasting contest. [Dataset]. http://doi.org/10.1371/journal.pone.0178808.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Francisco Javier Muñoz-Almaraz; Francisco Zamora-Martínez; Paloma Botella-Rocamora; Juan Pardo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Source: [9].

  18. f

    Comparison of GA-XGBoost with XGBoost and LightGBM test results.

    • figshare.com
    xls
    Updated Dec 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ke Peng; Yan Peng; Wenguang Li (2023). Comparison of GA-XGBoost with XGBoost and LightGBM test results. [Dataset]. http://doi.org/10.1371/journal.pone.0289724.t008
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 8, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Ke Peng; Yan Peng; Wenguang Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comparison of GA-XGBoost with XGBoost and LightGBM test results.

  19. f

    BreastSwinFedNetX

    • figshare.com
    zip
    Updated Mar 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rezaul Haque (2025). BreastSwinFedNetX [Dataset]. http://doi.org/10.6084/m9.figshare.28548758.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 6, 2025
    Dataset provided by
    figshare
    Authors
    Rezaul Haque
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The datasets used in this study were collected from the Kaggle platform. Below are their available links:1. BreakHis: https://www.kaggle.com/datasets/ambarish/breakhis2. Breast Ultrasound Images Dataset (BUSI): https://www.kaggle.com/datasets/sabahesaraki/breast-ultrasound-images-dataset3. CBIS-DDSM: https://www.kaggle.com/datasets/seanbaek19/cbis-ddsm-40964. INbreast: https://www.kaggle.com/datasets/eoussama/breast-cancer-mammograms/data5. Combined Dataset: https://www.kaggle.com/datasets/rezaullhaque/combined-dataset/dataThe total data size is 26 GB.

  20. A

    ‘Video Games Sales Dataset’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Dec 21, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2016). ‘Video Games Sales Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-video-games-sales-dataset-1d34/779cd618/?iid=015-666&v=presentation
    Explore at:
    Dataset updated
    Dec 21, 2016
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Video Games Sales Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/sidtwr/videogames-sales-dataset on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    Motivated by Gregory Smith's web scrape of VGChartz Video Games Sales, this data set simply extends the number of variables with another web scrape from Metacritic. Unfortunately, there are missing observations as Metacritic only covers a subset of the platforms. Also, a game may not have all the observations of the additional variables discussed below. Complete cases are ~ 6,900

    Content

    Alongside the fields: Name, Platform, Year_of_Release, Genre, Publisher, NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales, we have:-

    Critic_score - Aggregate score compiled by Metacritic staff Critic_count - The number of critics used in coming up with the Critic_score User_score - Score by Metacritic's subscribers User_count - Number of users who gave the user_score Developer - Party responsible for creating the game Rating - The ESRB ratings

    Acknowledgements

    This repository, https://github.com/wtamu-cisresearch/scraper, after a few adjustments worked extremely well!

    Inspiration

    It would be interesting to see any machine learning techniques or continued data visualisations applied on this data set.#

    --- Original source retains full ownership of the source dataset ---

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Quaranta, Luigi (2024). KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4468522

Data from: KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle

Related Article
Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
Quaranta, Luigi
Calefato, Fabio
Lanubile, Filippo
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

KGTorrent is a dataset of Python Jupyter notebooks from the Kaggle platform.

The dataset is accompanied by a MySQL database containing metadata about the notebooks and the activity of Kaggle users on the platform. The information to build the MySQL database has been derived from Meta Kaggle, a publicly available dataset containing Kaggle metadata.

In this package, we share the complete KGTorrent dataset (consisting of the dataset itself plus its companion database), as well as the specific version of Meta Kaggle used to build the database.

More specifically, the package comprises the following three compressed archives:

KGT_dataset.tar.bz2, the dataset of Jupyter notebooks;

KGTorrent_dump_10-2020.sql.tar.bz2, the dump of the MySQL companion database;

MetaKaggle27Oct2020.tar.bz2, a copy of the Meta Kaggle version used to build the database.

Moreover, we include KGTorrent_logical_schema.pdf, the logical schema of the KGTorrent MySQL database.

Search
Clear search
Close search
Google apps
Main menu