100+ datasets found

Titanic Dataset
kaggle.com
zip
Updated Apr 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Mudasar Sabir (2025). Titanic Dataset [Dataset]. https://www.kaggle.com/datasets/mudasarsabir/titanic-dataset
Explore at:
zip(8350 bytes)Available download formats
Dataset updated
Apr 25, 2025
Authors
Muhammad Mudasar Sabir
Description
Description 👋🛳️ Ahoy, welcome to Kaggle! You’re in the right place. This is the legendary Titanic ML competition – the best, first challenge for you to dive into ML competitions and familiarize yourself with how the Kaggle platform works.

If you want to talk with other users about this competition, come join our Discord! We've got channels for competitions, job postings and career discussions, resources, and socializing with your fellow data scientists. Follow the link here: https://discord.gg/kaggle

The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

Read on or watch the video below to explore more details. Once you’re ready to start competing, click on the "Join Competition button to create an account and gain access to the competition data. Then check out Alexis Cook’s Titanic Tutorial that walks you through step by step how to make your first submission!

The Challenge The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

Recommended Tutorial We highly recommend Alexis Cook’s Titanic Tutorial that walks you through making your very first submission step by step and this starter notebook to get started.

How Kaggle’s Competitions Work Join the Competition Read about the challenge description, accept the Competition Rules and gain access to the competition dataset. Get to Work Download the data, build models on it locally or on Kaggle Notebooks (our no-setup, customizable Jupyter Notebooks environment with free GPUs) and generate a prediction file. Make a Submission Upload your prediction as a submission on Kaggle and receive an accuracy score. Check the Leaderboard See how your model ranks against other Kagglers on our leaderboard. Improve Your Score Check out the discussion forum to find lots of tutorials and insights from other competitors. Kaggle Lingo Video You may run into unfamiliar lingo as you dig into the Kaggle discussion forums and public notebooks. Check out Dr. Rachael Tatman’s video on Kaggle Lingo to get up to speed!

What Data Will I Use in This Competition? In this competition, you’ll gain access to two similar datasets that include passenger information like name, age, gender, socio-economic class, etc. One dataset is titled train.csv and the other is titled test.csv.

Train.csv will contain the details of a subset of the passengers on board (891 to be exact) and importantly, will reveal whether they survived or not, also known as the “ground truth”.

The test.csv dataset contains similar information but does not disclose the “ground truth” for each passenger. It’s your job to predict these outcomes.

Using the patterns you find in the train.csv data, predict whether the other 418 passengers on board (found in test.csv) survived.

Check out the “Data” tab to explore the datasets even further. Once you feel you’ve created a competitive model, submit it to Kaggle to see where your model stands on our leaderboard against other Kagglers.

How to Submit your Prediction to Kaggle Once you’re ready to make a submission and get on the leaderboard:

Click on the “Submit Predictions” button

Upload a CSV file in the submission file format. You’re able to submit 10 submissions a day.

Submission File Format: You should submit a csv file with exactly 418 entries plus a header row. Your submission will show an error if you have extra columns (beyond PassengerId and Survived) or rows.

The file should have exactly 2 columns:

PassengerId (sorted in any order) Survived (contains your binary predictions: 1 for survived, 0 for deceased) Got it! I’m ready to get started. Where do I get help if I need it? For Competition Help: Titanic Discussion Forum Kaggle doesn’t have a dedicated team to help troubleshoot your code so you’ll typically find that you receive a response more quickly by asking your question in the appropriate forum. The forums are full of useful information on the data, metric, and different approaches. We encourage you to use the forums often. If you share your knowledge, you'll find that others will share a lot in turn!

A Last Word on Kaggle Notebooks As we mentioned before, Kaggle Notebooks is our no-setup, customizable, Jupyter Notebooks environment with free GPUs and a huge repository ...
Coding Questions with Solutions
kaggle.com
zip
Updated Nov 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Coding Questions with Solutions [Dataset]. https://www.kaggle.com/datasets/thedevastator/coding-questions-with-solutions/data
Explore at:
zip(452781832 bytes)Available download formats
Dataset updated
Nov 27, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Coding Questions with Solutions

Introductory, Interview and Competition Levels

By Huggingface Hub [source]

About this dataset

Codeparrot's Apps dataset provides an invaluable tool for coders of all levels to effectively learn and fully understand the programming language of Python. Through a comprehensive collection of programming questions accompanied by detailed solutions, input/output test cases, and related information written in Python, aspiring coders can quickly explore the mysterious depths of coding with confidence. Comprised of natural language questions alongside their respective solutions in Python, this dataset is a perfect starting point for coders looking to unlock the hidden power behind coding - the ability to create something from nothing! Take your first steps today with Codeparrot's Apps dataset and discover how much you can achieve within this powerful language as you continue your journey into programming

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Using this dataset is fairly simple given all its columns are neatly organized in an orderly fashion within one table or CSV file format – unless otherwise stated – so reading through them should be straightforward enough even with minimal coding experience or knowledge on one’s part. All a user needs do is find an appropriate question according to their desired difficulty rating or topic along with correct accompanying information pertains to it - including any relevant starter code provided - then copy & paste those into their local environment while running their test cases supplied against provided input & output values thus verifying if everything works correctly before executing one's own personal modifications or additions in order attempt respond accurately & appropriately at best they can according problem instructions accordingly afterwards sending back response for review/feedback if necessary after completion whenever warranted too appropriately doing so also properly prepare ahead time due additional practicing before appearing within official competitive situations such become quite helpful unexpectedly even unexpected too become thenceforth potentially wholly rewarding unto every learner able put themselves situation whenever likewise opportunity arises successful results inevitably follow then shortly thereafter forget ever worry remaining confused regarding specific matters no more again either whatsoever might prevail occur overnight thus proficiency gained soonest possible manner instead slowly pertaining continuously arduously least preventing further confusion sudden cognitive storms still moreover due accidentally prematurely construed assumptions choosing take broadknowledged approach learning basics saves boatloads sanity time money ultimately goes much further everybody

Research Ideas

As a teaching and learning tool to help beginners get comfortable with programming in Python.

As an interview prep tool for experienced coders, since it contains level-specific code examples and test cases for each question.

As a competition resource, wherein contestants can try out different solutions and compare them against each other to identify the most efficient one within the given data set

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:-----------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | question | Natural language question related to coding. (String) | | solutions | Set of solutions written in Python for each given question. (String) | | **inp...
Santa 2024 - The Perplexity
kaggle.com
zip
Updated Dec 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nigel Agordorku (2024). Santa 2024 - The Perplexity [Dataset]. https://www.kaggle.com/datasets/nigelagordorku/santa-2024-the-perplexity
Explore at:
zip(603 bytes)Available download formats
Dataset updated
Dec 21, 2024
Authors
Nigel Agordorku
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Participating in Kaggle's "Santa 2024 - The Perplexity Permutation Puzzle" is an excellent opportunity to enhance your data science skills. Here's a step-by-step guide to help you get started:

Create a Kaggle Account:

If you haven't already, sign up for a free account on Kaggle. Join the Competition:

Navigate to the Santa 2024 competition page. Click the "Join Competition" button to register. Review and accept the competition rules. Understand the Problem Statement:

Read the Overview and Description sections to grasp the challenge. The task involves helping Rudolph descramble holiday-related words to make the Large Language Models (LLMs) happy. Review the Evaluation Metric:

Understand how your submissions will be scored by reading the evaluation criteria provided on the competition page. Download the Dataset:

Go to the "Data" tab on the competition page. Download the provided datasets, which include training and test data. Set Up Your Development Environment:

Install necessary tools such as Python and Jupyter Notebook. Consider using Kaggle Kernels, which provide a cloud-based environment with pre-installed libraries. Explore the Data:

Load the dataset and perform exploratory data analysis (EDA) to understand its structure and contents. Identify patterns or anomalies that could inform your modeling approach. Develop a Baseline Model:

Start with a simple model to establish a performance baseline. This could involve basic algorithms or heuristics to unscramble words. Feature Engineering:

Create new features that might improve model performance, such as letter frequency analysis or positional encoding. Model Selection and Training:

Experiment with various machine learning models, such as decision trees, support vector machines, or neural networks. Train your models using the training dataset. Model Evaluation:

Assess your models using appropriate metrics to ensure they generalize well to unseen data. Use cross-validation techniques to validate your models. Optimize and Tune Hyperparameters:

Fine-tune your model's hyperparameters to enhance performance. Consider using grid search or randomized search methods. Prepare and Submit Predictions:

Generate predictions on the test dataset. Format your submission file as specified in the competition guidelines. Upload your submission through the "Submit Predictions" button on the competition page. Monitor the Leaderboard:

After submission, check the leaderboard to see how your model ranks against others. Use this feedback to iterate and improve your model. Engage with the Community:

Participate in the discussion forums to share insights and learn from fellow competitors. Review shared code and solutions to gain new perspectives. Adhere to the Timeline:

Be mindful of key dates: Start Date: November 21, 2024 Entry Deadline: January 24, 2025 Final Submission Deadline: January 24, 2025 For additional guidance, consider reading articles like "My First Real Kaggle Competition — A step by step guide for beginners, from a beginner" and "Dive into the World of Kaggle Competitions: A Step-by-Step Guide".

By following these steps and utilizing available resources, you'll be well on your way to successfully participating in the Santa 2024 competition.
github-final-datasets
kaggle.com
zip
Updated Nov 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Olga Ivanova (2023). github-final-datasets [Dataset]. https://www.kaggle.com/datasets/olgaiv39/github-final-datasets
Explore at:
zip(1877861953 bytes)Available download formats
Dataset updated
Nov 9, 2023
Authors
Olga Ivanova
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Github Clean Code Snippets Dataset

Here is a description, how the datasets for a training notebook used for Telegram ML Contest solution were prepared.

1 Step - Github Samples Database parsing

The first part of the code samples was taken from a private version of this notebook.

Here is the statistics about classes of programming languages from Github Code Snippets database https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2F2fdc091661198e80559f8cb1d1a306ff%2FScreenshot%202023-11-07%20at%2021.24.42.png?generation=1699390166413391&alt=media" alt="">

From this database, 2 csv files were created - with 50000 code samples for each of the 20 programming languages included, with equal by numbers and stratified sampling. The files related here are sample_equal_prop_50000.csv and sample_equal_prop_50000.csv and sample_stratified_50000.csv, respectively.

2 Step - Github Bigquery Database parsing

Second option for capturing out additional examples was to run this notebook with making up larger amount of queries, 10000.

The resulted file is dataset-10000.csv - included to the data card

The statistics for the code programming languages is as on the next chart - it has 32 labeled classes
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2F7c04342da8ec1df266cd90daf00204f9%2FScreenshot%202023-10-13%20at%2020.52.13.png?generation=1699392769199533&alt=media" alt="">

3 Step - collection of code samples of raw coding samples

To get a model more robust, code samples of 20 additional languages were collected in amount from 10 till 15 samples on more-less popular use cases. Also, for the class "OTHER", like regular language examples, according to the task of the competition, the text examples from this dataset with promts on Huggingface were added to the file. The resulted file here is rare_languages.csv - also in data card

The statistics for rare languages code snippets is as follows: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2F0b340781c774d2acb988ce1567f4afa3%2FScreenshot%202023-11-08%20at%2001.13.07.png?generation=1699402436798661&alt=media" alt="">

4 Step - First and second datasets combining

For this stage of dataset creation, the number of the columns in sample_equal_prop_50000.csv and sample_stratified_50000.csv was cut out just for 2 - "snippet", "language", the version of file with equal numbers is in the data card - sample_equal_prop_50000_clean.csv

To prepare Bigquery dataset file, the column with index was cut out, and the column "content" was renamed to "snippet". These changes were saved in dataset-10000-clean.csv

After that, the files sample_equal_prop_50000_clean.csv and dataset-10000-clean.csv were combined together and saved as github-combined-file.csv

5 Step - Datasets cleaning from symbols and merging together with rare languages

The prepared files took too much RAM to be read by Pandas library, so that is why additional prepocessing has been made - the symbols like quatas, commas, ampersands, new lines and adding tabs characters were cleaned out. After clieaning, the flies were merged with rare_languages.csv file and saved as github-combined-file-no-symbols-rare-clean.csv and sample_equal_prop_50000_-no-symbols-rare-clean.csv, respectively.

The final distribution of classes turned out to be the next one https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2Ff43e0cea4c565c9f7c808527b0dfa2da%2FScreenshot%202023-11-09%20at%2020.26.30.png?generation=1699558064765454&alt=media" alt="">

6 Step - Fixing up the labels

To be suitable for TF-DF format, to each programming language a certain label was given as well. The final labels are in the data card.
Kaggle Dataset Metadata Repository
kaggle.com
zip
Updated Nov 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ijaj Ahmed (2024). Kaggle Dataset Metadata Repository [Dataset]. https://www.kaggle.com/datasets/ijajdatanerd/kaggle-dataset-metadata-repository
Explore at:
zip(5122110 bytes)Available download formats
Dataset updated
Nov 16, 2024
Authors
Ijaj Ahmed
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13367141%2F444a868e669671faf9007822d6f2d348%2FAdd%20a%20heading.png?generation=1731775788329917&alt=media" alt="">

Kaggle Dataset Metadata Collection 📊

This dataset provides comprehensive metadata on various Kaggle datasets, offering detailed information about the dataset owners, creators, usage statistics, licensing, and more. It can help researchers, data scientists, and Kaggle enthusiasts quickly analyze the key attributes of different datasets on Kaggle. 📚

Dataset Overview:

Purpose: To provide detailed insights into Kaggle dataset metadata.

Content: Information related to the dataset's owner, creator, usage metrics, licensing, and more.

Target Audience: Data scientists, Kaggle competitors, and dataset curators.

Columns Description 📋

datasetUrl 🌐: The URL of the Kaggle dataset page. This directs you to the specific dataset's page on Kaggle.

ownerAvatarUrl 🖼️: The URL of the dataset owner's profile avatar on Kaggle.

ownerName 👤: The name of the dataset owner. This can be the individual or organization that created and maintains the dataset.

ownerUrl 🌍: A link to the Kaggle profile page of the dataset owner.

ownerUserId 💼: The unique user ID of the dataset owner on Kaggle.

ownerTier 🎖️: The ownership tier, such as "Tier 1" or "Tier 2," indicating the owner's status or level on Kaggle.

creatorName 👩‍💻: The name of the dataset creator, which could be different from the owner.

creatorUrl 🌍: A link to the Kaggle profile page of the dataset creator.

creatorUserId 💼: The unique user ID of the dataset creator.

scriptCount 📜: The number of scripts (kernels) associated with this dataset.

scriptsUrl 🔗: A link to the scripts (kernels) page for the dataset, where you can explore related code.

forumUrl 💬: The URL to the discussion forum for this dataset, where users can ask questions and share insights.

viewCount 👀: The number of views the dataset page has received on Kaggle.

downloadCount ⬇️: The number of times the dataset has been downloaded by users.

dateCreated 📅: The date when the dataset was first created and uploaded to Kaggle.

dateUpdated 🔄: The date when the dataset was last updated or modified.

voteButton 👍: The metadata for the dataset's vote button, showing how users interact with the dataset's quality ratings.

categories 🏷️: The categories or tags associated with the dataset, helping users filter datasets based on topics of interest (e.g., "Healthcare," "Finance").

licenseName 🛡️: The name of the license under which the dataset is shared (e.g., "CC0," "MIT License").

licenseShortName 🔑: A short form or abbreviation of the dataset's license name (e.g., "CC0" for Creative Commons Zero).

datasetSize 📦: The size of the dataset in terms of storage, typically measured in MB or GB.

commonFileTypes 📂: A list of common file types included in the dataset (e.g., .csv, .json, .xlsx).

downloadUrl ⬇️: A direct link to download the dataset files.

newKernelNotebookUrl 📝: A link to a new kernel or notebook related to this dataset, for those who wish to explore it programmatically.

newKernelScriptUrl 💻: A link to a new script for running computations or processing data related to the dataset.

usabilityRating 🌟: A rating or score representing how usable the dataset is, based on user feedback.

firestorePath 🔍: A reference to the path in Firestore where this dataset’s metadata is stored.

datasetSlug 🏷️: A URL-friendly version of the dataset name, typically used for URLs.

rank 📈: The dataset's rank based on certain metrics (e.g., downloads, votes, views).

datasource 🌐: The source or origin of the dataset (e.g., government data, private organizations).

medalUrl 🏅: A URL pointing to the dataset's medal or badge, indicating the dataset's quality or relevance.

hasHashLink 🔗: Indicates whether the dataset has a hash link for verifying data integrity.

ownerOrganizationId 🏢: The unique organization ID of the dataset's owner if the owner is an organization rather than an individual.

totalVotes 🗳️: The total number of votes the dataset has received from users, reflecting its popularity or quality.

category_names 📑: A comma-separated string of category names that represent the dataset’s classification.

This dataset is a valuable resource for those who want to analyze Kaggle's ecosystem, discover high-quality datasets, and explore metadata in a structured way. 🌍📊
📊 Meta Kaggle| Kaggle Users' Stats
kaggle.com
zip
Updated Jun 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BwandoWando (2025). 📊 Meta Kaggle| Kaggle Users' Stats [Dataset]. https://www.kaggle.com/datasets/bwandowando/meta-kaggle-users-stats
Explore at:
zip(340057308 bytes)Available download formats
Dataset updated
Jun 26, 2025
Authors
BwandoWando
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Image

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1842206%2Ff84a67b64934ccfdd6fd4bfc24db094d%2F_982f849a-87df-44ff-94ff-3fc97c6198aa-small2.jpeg?generation=1738169001850229&alt=media" alt="">

History

03Mar2025- when determining last content shared, I am now using the latest version of Model, Dataset, and Notebook, rather than the creation date of the very first version. I also added the reaction counts which was a new csv added in the MetaKaggle dataset. The discussion can be found here . I also added versions created for Model, Notebook, and Dataset to properly track users that are updating their datasets.

04Feb2025- Fixed the issue on ModelUpvotesGiven and ModelUpvotesReceived values being identical

Context

User aggregated stats and data using the Official Meta Kaggle dataset

Note

Expect some discrepancies between the counts seen in your profile, because, aside from there is a lag of one to two days before a new dataset is published, some information such as Kaggle staffs' upvotes and private competitions are not included. But for almost all members, the figures should reconcile

Notebook updater

📊 (Scheduled) Meta Kaggle Users' Stats

Image

Generated with Bing image generator
CMM202 Topic 0
kaggle.com
zip
Updated Jan 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carlos Moreno-Garcia (2021). CMM202 Topic 0 [Dataset]. https://www.kaggle.com/carlosmorenogarcia/cmm201-week-1
Explore at:
zip(33306 bytes)Available download formats
Dataset updated
Jan 15, 2021
Authors
Carlos Moreno-Garcia
Description
Dataset

This dataset was created by Carlos Moreno-Garcia

Contents
Job Offers Web Scraping Search
kaggle.com
zip
Updated Feb 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Job Offers Web Scraping Search [Dataset]. https://www.kaggle.com/datasets/thedevastator/job-offers-web-scraping-search
Explore at:
zip(5322 bytes)Available download formats
Dataset updated
Feb 11, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Job Offers Web Scraping Search

Targeted Results to Find the Optimal Work Solution

By [source]

About this dataset

This dataset collects job offers from web scraping which are filtered according to specific keywords, locations and times. This data gives users rich and precise search capabilities to uncover the best working solution for them. With the information collected, users can explore options that match with their personal situation, skillset and preferences in terms of location and schedule. The columns provide detailed information around job titles, employer names, locations, time frames as well as other necessary parameters so you can make a smart choice for your next career opportunity

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset is a great resource for those looking to find an optimal work solution based on keywords, location and time parameters. With this information, users can quickly and easily search through job offers that best fit their needs. Here are some tips on how to use this dataset to its fullest potential:

Start by identifying what type of job offer you want to find. The keyword column will help you narrow down your search by allowing you to search for job postings that contain the word or phrase you are looking for.

Next, consider where the job is located – the Location column tells you where in the world each posting is from so make sure it’s somewhere that suits your needs!

Finally, consider when the position is available – look at the Time frame column which gives an indication of when each posting was made as well as if it’s a full-time/ part-time role or even if it’s a casual/temporary position from day one so make sure it meets your requirements first before applying!

Additionally, if details such as hours per week or further schedule information are important criteria then there is also info provided under Horari and Temps Oferta columns too! Now that all three criteria have been ticked off - key words, location and time frame - then take a look at Empresa (Company Name) and Nom_Oferta (Post Name) columns too in order to get an idea of who will be employing you should you land the gig!

All these pieces of data put together should give any motivated individual all they need in order to seek out an optimal work solution - keep hunting good luck!

Research Ideas

Machine learning can be used to groups job offers in order to facilitate the identification of similarities and differences between them. This could allow users to specifically target their search for a work solution.

The data can be used to compare job offerings across different areas or types of jobs, enabling users to make better informed decisions in terms of their career options and goals.

It may also provide an insight into the local job market, enabling companies and employers to identify where there is potential for new opportunities or possible trends that simply may have previously gone unnoticed

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: web_scraping_information_offers.csv | Column name | Description | |:-----------------|:------------------------------------| | Nom_Oferta | Name of the job offer. (String) | | Empresa | Company offering the job. (String) | | Ubicació | Location of the job offer. (String) | | Temps_Oferta | Time of the job offer. (String) | | Horari | Schedule of the job offer. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .
Music Streaming Customer Churn
kaggle.com
zip
Updated Nov 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dalia Do (2025). Music Streaming Customer Churn [Dataset]. https://www.kaggle.com/datasets/daliado98/music-streaming-customer-churn
Explore at:
zip(180 bytes)Available download formats
Dataset updated
Nov 3, 2025
Authors
Dalia Do
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Overview

For this project, you are a Business Analyst working for a Music Streaming Service called PlaylistPro.

At PlaylistPro, one of company's top priorities is to reduce customer churn (stop customers from cancelling their subscriptions).

To reduce churn, PlaylistPro plans to reduce the cost of the music streaming service for customers that are likely to cancel their subscription. But, the company currently does not know which customers are likely or unlikely to cancel their subscription.

Thus, PlaylistPro would like you to build a Supervised Classification Model to predict if a customer will churn based on their subscription information and listening habits. To help you build a model, the company's Operations team created a dataset of 40,000 customers to train your model on. More details are below:

What you need to do?

PlaylistPro is asking you to complete the following 3 tasks:

Build a Classification Model in a Kaggle Notebook to predict the churned variable in train.csv

Used your trained Classification Model to predict the churned values in test.csv and successfully receive an official F1 score on Kaggle

Present your findings to the PlaylistPro executive team by answering the following 3 questions at the end of your Kaggle notebook:

With your model, can we reliably predict which customers will churn?

How does your model work?

From the customers in the testing data (test.csv), what percentage of customers do you expect we will retain?

How to get started?

First, go to the Data tab of this Kaggle competition to view all the features in the train.csv

Second, go to the Code tab of this Kaggle competition to create a new Kaggle notebook - put your name and your teammates names in a comment at the top of this code.

Third, build your Classification model in a Kaggle notebook

Fourth, successfully submit your predictions by creating a submission.csv file in your notebook and clicking submit (more details on this below)

Fifth, in 3 separate comments at the bottom of your Kaggle notebook answer the 3 questions from the PlaylistPro executive team listed above

Files

train.csv - the training set

test.csv - the test set

sample_submission.csv - a sample submission file in the correct format

Columns

customer_id - a unique customer identification number

age - the age of the user

location - the US state of the user

subscription_type - type of subsciption

payment_plan - how often the user pays, monthly of annually

num_subscription_pauses - number of times the user has paused their subscription (max 2)

payment_method - form of user payment

customer_service_inquiries - the frequency of customer service inquiries from the user

signup_date - date the user signed up for the music subscription service

weekly_hours - average number of weekly listening hours

average_session_length - average length of each music listening session (in hours)

song_skip_rate - percentage of songs the user does not finish

weekly_songs_played - average number of songs the user plays in a week

weekly_unique_songs - average number of unique songs the user plays in a week

num_favorite_artists - number of artists the user set as favorite artists

num_platform_friends - number of user connections in the app

num_playlists_created - number of playlists the user created

num_shared_playlists - number of playlists that are shared publicly

notifications_clicked - number of in-app notifications clicked on

churned - this is the target variable, 0 = customer is active, 1 = customer churned
An Analysis of Engineering-as-Marketing Tools
kaggle.com
Updated Jan 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). An Analysis of Engineering-as-Marketing Tools [Dataset]. https://www.kaggle.com/datasets/thedevastator/an-analysis-of-engineering-as-marketing-tools
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 12, 2023
Dataset provided by
Kaggle
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
An Analysis of Engineering-as-Marketing Tools

Strategies for Expanding Business Reach

By Ian Greenleigh [source]

About this dataset

The engineering-as-marketing tools available today allow startups to maximize and take advantage of the engineering talents they possess. By creating useful tools such as calculators, widgets and microsites, businesses can get in front of potential customers and lead them to their products or services.

This dataset provides a comprehensive list of companies who are using engineering as a marketing strategy and the respective tools these companies have created for it. For each company you get information about their name, product/service, tool name, what the tool does and a URL for further information about it. Additionally there is an extra notes field providing more details about each company’s market habit or any other additional facts that could be relevant in understanding better the use cases these companies are leading with this new way of doing marketing through engineering driven strategies.

With this data you will be able to take a closer look at how effectively this strategy is working while being able to compare different approaches taken inside each industry vertical in order to maximize conversions among leads generated by all these amazing pieces work made possible by software engineers everywhere devoted every day making our lives easier constantly!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Analyzing this data allows users to gain insights into how successful companies are using engineering-as-marketing techniques to generate leads and expand their customer base. It also provides a valuable resource for other organizations wanting to learn more about how other organizations have achieved success with such practices.

This dataset can be used in many ways such as:

Analyzing different trends in which engineering-as-marketing techniques are being used across multiple industries

Examining whether certain techniques lead to higher lead generation or increased customer base

Comparing effectiveness between companies using different types of tools etc.

To get started with this dataset, simply load it up into some kind of data analysis software package that supports csv file processing capabilities such as Tableau or R Studio. Then define each column appropriately by adding appropriate labels onto them so that they can be understood easily when looked at from a first glance perspective by yourself or other members on your team who are looking over your datasets before any analyses start happening on those files within your chosen data analysis software package . Now you should be all set up for analyzing this dataset!

Research Ideas

Leveraging this data to understand the effectiveness of engineering-as-marketing for various companies.

Creating a sentiment analysis of customers’ responses to engineering-as-marketing tools in order to determine which tools are most popular and successful.

Analyzing what types of engineering-as-marketing tools have been most successful with specific customer segments, to inform future product development and marketing tactics

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: Engineering as Marketing.csv | Column name | Description | |:-------------------|:-------------------------------------------------------------------| | Company name | The name of the company. (String) | | What co does | A brief description of what the company does. (String) | | Tool name | The name of the engineering-as-marketing tool. (String) | | What tool does | A brief description of what the tool does. (String) | | URL | The URL of the engineering-as-marketing tool. (String) | | Notes | Additional notes about the engineering-as-marketing tool. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Ian Greenleigh.
Additional Data [Predict Students Performance]
kaggle.com
zip
Updated May 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gleb Kazakov (2023). Additional Data [Predict Students Performance] [Dataset]. https://www.kaggle.com/datasets/glipko/additional-data-predict-students-performance
Explore at:
zip(704709399 bytes)Available download formats
Dataset updated
May 24, 2023
Authors
Gleb Kazakov
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

This dataset was created for the competition "Predict Student Performance from Game Play" which aims to predict student performance during game-based learning in real-time based on their game logs. The dataset's source raw data is available on the developers's site, which can be used as supplemental data. The idea for this dataset was discovered in this discussion.

Generating Script

To extract the data, I used my notebook.

Content

The dataset consists of two file types: 1. Files with train data (_train suffix) 2. Files with labels (_labels suffix) for each non-empty monthly dataset and its ID. There are 20 monthly datasets available on the mentioned site.

I tried to replicate the competition's data format as closely as possible, which involved:

Creating only necessary columns

Removing irrelevant data For example, navigate_hover events and quiz logs that are not present in the competition, were removed. However, if you find any inconsistencies in the dataset or in the generating script, please do share!

I also added save codes, so you can find out if players started from one of the saves. As I know in competition's dataset all players started from the beggining so you may like to ignore players, who use save codes.

Game Quits

One interesting aspect of the raw data is that it includes users who quit the game before it ended and may have stopped playing before completing a quiz. I only included users who passed at least the first quiz, which opens up possibilities to supplement data for the first level group, which has the least amount of features.

Implementing all the new logic with this dataset into pipelines may be difficult, and increasing train size may lead to memory errors. Additionally, some sessions are already present in the competition and must be ignored.

Motivation

I am sharing this dataset with the Kaggle community because I have university exams and do not have enough time to make the implementation myself. However, I believe that supplemental data with proper data cleaning techniques will greatly boost performance. Good luck!
hm-atomic-interation-with-item-feature
kaggle.com
zip
Updated Mar 6, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nguyentuananh (2022). hm-atomic-interation-with-item-feature [Dataset]. https://www.kaggle.com/datasets/astrung/hm-atomic-interation-with-item-feature
Explore at:
zip(533869479 bytes)Available download formats
Dataset updated
Mar 6, 2022
Authors
Nguyentuananh
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

I have create a notebook showed how to create recommendation in this competition: https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations

For using Recbole, you need to create atomic file first: https://recbole.io/docs/user_guide/data/atomic_files.html

In this dataset, i have already creating atomic file for you, so you can download and directly use it with recbole

If you want a tutorial for using sequential model with item features, you can check my notebook: https://www.kaggle.com/astrung/recbole-lstm-sequential-with-item-features if you want a notebook using only iterations, without user feautures, you can use this notebook: https://www.kaggle.com/astrung/recbole-lstm-for-recomendation

Content

This data contains atomic files of all of interactions and item features data for this competition: You can use this data with Recbole for faster creating recomendation engine. You can use my notebook as a guide for using Recbole to create recommendation result and submit it
Malware DataSet
kaggle.com
zip
Updated May 25, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oronzo Comi (2021). Malware DataSet [Dataset]. https://www.kaggle.com/oronzo/malware-dataset
Explore at:
zip(53084715 bytes)Available download formats
Dataset updated
May 25, 2021
Authors
Oronzo Comi
Description
Dataset

This dataset was created by Oronzo Comi

Contents

It contains the following files:
Illinois Insurance Producers Data
kaggle.com
zip
Updated Jan 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Illinois Insurance Producers Data [Dataset]. https://www.kaggle.com/datasets/thedevastator/illinois-insurance-producers-data/suggestions
Explore at:
zip(18416819 bytes)Available download formats
Dataset updated
Jan 7, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
Illinois
Description
Illinois Insurance Producers Data

Their Last Name/Business Name, Address, and More

By data.world's Admin [source]

About this dataset

This dataset contains vital information about insurance producers in Illinois. It includes detailed and comprehensive data relating to the last name or business name, first name, mailing address line 1 and 2, city, state, zip code and Line of Authority (LOA). This extensive dataset is a great source of information for researchers who are interested in understanding the insurance production industry in Illinois. With its up-to-date data points that cover all aspects of the insurance producer market in Illinois, this dataset can be used by individuals to make informed decisions about the insurance risk coverage options within their state. The set is timely updated so you can be sure that you are getting an accurate picture of the market landscape today!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset contains information about insurance producers in Illinois, which includes their last name or business name, first name, mailing address, city, state, zip code and Line of Authority. With this data set you can use it to get a better understanding of the insurance industry in Illinois and learn more about the population of producers.

Guidelines for using this dataset

Inspect the data: Before starting your analysis take some time to go over all columns included in the dataset and make sure they are understandable and relevant to your objectives.

Cleaning the Data: Depending on your needs you may find it necessary to clean up and/or transform some of the data so that you can more easily analyze it. Take caution when cleaning or transforming as any changes may have an effect on your outcome later on during analysis so make sure that what you do translates accurately into meaningful insights rather than incorrect conclusions due to mistake manipulation nof the data set

Analyze: Start by looking through descriptive statistics such as aggregate values (mean/median) or frequencies (counts/percentages) for each field or combination of fields from which one can draw valid insights from. You might then wish to tackle deeper deeper analytical questions based off a few hypotheses such as correlations between two variables etc.. Make sure also at all times verify assumptions with evidence from provided datasets

Report: Prepare a summary report including any additional analysis recommendations based upon findngs drawn from both descriptive statistics as well as deeper analytic work done regarding potential correlation between variables

Research Ideas

Using the insurance producer dataset and geographic data, create an interactive map of Illinois to visualize where the most insurance producers are located with detailed country, state and city data.

Generate a report to provide insights on which Insurance Producers have concentrated Lines of Authority in a certain area or across multiple states in order to identify any emerging trends in insurance markets or areas in need of additional coverage options

Leverage AI algorithms and machine learning techniques to create a predictive model that predicts which Lines of Authority will be more successful for producers operating in certain geographical areas based on their past performance, demographic information etc

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: doi-insurance-producers-1.csv | Column name | Description | |:-------------------------------|:---------------------------------------------------------------| | LAST_NAME_OR_BUSINESS_NAME | Last name or business name of the insurance producer. (String) | | FIRST_NAME | First name of the insurance producer. (String) | | MLG_ADDRESS1 | Mailing address line 1 of the insurance producer. (String) | | MLG_ADDRESS2 | Mailing address line 2 of the insurance producer. (String) | | MAILING_CITY | City of the insurance produc...
NYC_building_energy_data
kaggle.com
zip
Updated Mar 4, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maksym Dubovyi (2020). NYC_building_energy_data [Dataset]. https://www.kaggle.com/maxbrain/nyc-building-energy-data
Explore at:
zip(9476304 bytes)Available download formats
Dataset updated
Mar 4, 2020
Authors
Maksym Dubovyi
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Area covered
New York
Description
In this notebook, we will walk through solving a complete machine learning problem using a real-world dataset. This was a "homework" assignment given to me for a job application over summer 2018. The entire assignment can be viewed here and the one sentence summary is:

Use the provided building energy data to develop a model that can predict a building's Energy Star score, and then interpret the results to find the variables that are most predictive of the score.

This is a supervised, regression machine learning task: given a set of data with targets (in this case the score) included, we want to train a model that can learn to map the features (also known as the explanatory variables) to the target.

Supervised problem: we are given both the features and the target Regression problem: the target is a continous variable, in this case ranging from 0-100 During training, we want the model to learn the relationship between the features and the score so we give it both the features and the answer. Then, to test how well the model has learned, we evaluate it on a testing set where it has never seen the answers!

Machine Learning Workflow Although the exact implementation details can vary, the general structure of a machine learning project stays relatively constant:

Data cleaning and formatting Exploratory data analysis Feature engineering and selection Establish a baseline and compare several machine learning models on a performance metric Perform hyperparameter tuning on the best model to optimize it for the problem Evaluate the best model on the testing set Interpret the model results to the extent possible Draw conclusions and write a well-documented report Setting up the structure of the pipeline ahead of time lets us see how one step flows into the other. However, the machine learning pipeline is an iterative procedure and so we don't always follow these steps in a linear fashion. We may revisit a previous step based on results from further down the pipeline. For example, while we may perform feature selection before building any models, we may use the modeling results to go back and select a different set of features. Or, the modeling may turn up unexpected results that mean we want to explore our data from another angle. Generally, you have to complete one step before moving on to the next, but don't feel like once you have finished one step the first time, you cannot go back and make improvements!

This notebook will cover the first three (and a half) steps of the pipeline with the other parts discussed in two additional notebooks. Throughout this series, the objective is to show how all the different data science practices come together to form a complete project. I try to focus more on the implementations of the methods rather than explaining them at a low-level, but have provided resources for those who want to go deeper. For the single best book (in my opinion) for learning the basics and implementing machine learning practices in Python, check out Hands-On Machine Learning with Scikit-Learn and Tensorflow by Aurelion Geron.

With this outline in place to guide us, let's get started!
EURAXESS Data Scientist Jobs
kaggle.com
zip
Updated Feb 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). EURAXESS Data Scientist Jobs [Dataset]. https://www.kaggle.com/datasets/thedevastator/euraxess-data-scientist-jobs
Explore at:
zip(48610 bytes)Available download formats
Dataset updated
Feb 11, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
EURAXESS Data Scientist Jobs

Explore Job Opportunities Across the European Union

By [source]

About this dataset

This dataset contains a collection of data science job offers from the European EURAXESS database. It includes detailed information on the job title, salary, position type, location, sector and company name - allowing you to see the kind of opportunities available if you pursue a career in data science. With this comprehensive set of data points at your disposal, it's easy to explore highly diverse roles and compare different employers to find the right fit for you; all while gaining valuable insight into recent hiring trends in the European Union's labor market! Whether you are thinking about taking your first steps into Data Science or are already experienced in this field, this dataset provides an up-to-date referential source helping better align your professional aspirations with actual opportunities

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset contains job offers for data scientists in the EURAXESS database. It includes relevant information such as company name, job title, salary, location and job description.

The dataset can be used to get an understanding of current trends in data science jobs and salaries in Europe. This can help individuals or companies determine where to focus their resources or look for new data science opportunities.

To start using this dataset, we recommend taking a look at the columns first. There are five main columns - company name, job title, salary (where available), location and job description - which provide detailed information about each individual offer for a data scientist position. By examining these attributes of each position you’ll be able to understand the different requirements for each role across various European countries and begin formulating your search strategy from there.

When considering specific offers within this dataset it's important to consider not just the physical location but other aspects such as potential growth opportunities within organizations or desired levels of seniority regarding developing/applying models with complex datasets as well as fluctuating demands of managing fast-paced projects with tight deadlines etc…so it's advised to read through all of the details provided when evaluating opportunities specifically tailored to your needs accordingly.

If you’re looking beyond just salary numbers though then keep an open mind when examining all available positions since while money is always important; things like more vacation days or flexible working hours may fit well into personal priorities too! Ultimately it's up to you to decide on what parameters work best for you when locating a suitable role via this dataset according to your criteria; financials aside being sure that any prospective employer meets certain standards in terms of coding/database frameworks & principles expected from prospective employees also provides great peace of mind towards landing successful & long-term endeavors so never forget that small detail whilst narrowing down selections!

Research Ideas

Analyzing the language preferences specified in data science job offers in EURAXESS to gain insight into the language requirements of the data science market across different European countries.

Comparing salary averages between job postings within EURAXESS to identify potential discrepancies between wages paid for similar positions across countries or differences in job requirements at a given pay grade.

Identifying trends in other special qualifications (e.g., degree, certification) required for data scientist roles within EURAXESS compared to other similar datasets from other regions such as North America, Asia, etc

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: Data Scientist.csv

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .
vllm-0.9.2-offline-installer
kaggle.com
zip
Updated Jul 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sadegh Akbari (2025). vllm-0.9.2-offline-installer [Dataset]. https://www.kaggle.com/datasets/sadeghakbari/vllm-0-9-2-offline-installer/data
Explore at:
zip(7873048987 bytes)Available download formats
Dataset updated
Jul 16, 2025
Authors
Sadegh Akbari
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Context & Motivation

This dataset provides a comprehensive, self-contained offline installer for the vllm library, a high-throughput engine for LLM inference. It is specifically designed to solve the common "no internet access" problem in Kaggle competitions like the ARC Prize, where packages must be installed from local files. Using this dataset eliminates pip install failures and ensures a consistent, reproducible environment for your submission notebook.

Content The dataset contains a single directory, vllm_wheels, which includes the Python wheel file for vllm==0.9.2 and all of its required dependencies. These files were downloaded and packaged in a standard Kaggle environment to ensure maximum compatibility with the competition's execution environment (Python 3.10, CUDA 12.x).

Usage To use this dataset in your Kaggle notebook (with internet turned OFF):

Add this dataset as an input source to your notebook.

Place the following code in the first cell of your notebook to install vllm before any other code runs:

import os # --- vLLM Offline Installation --- # Path to the directory containing the wheel files WHEELS_PATH = "/kaggle/input/vllm-0-9-2-offline-installer/vllm_wheels" print("Starting offline installation of vLLM...") !pip install --no-index --find-links={WHEELS_PATH} vllm print("Installation complete.") # Verify the installation import vllm print(f"vLLM version {vllm._version_} successfully installed.")
Comprehensive Medical Q&A Dataset
kaggle.com
zip
Updated Nov 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Comprehensive Medical Q&A Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/comprehensive-medical-q-a-dataset
Explore at:
zip(5126941 bytes)Available download formats
Dataset updated
Nov 24, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Comprehensive Medical Q&A Dataset

Unlocking Healthcare Data with Natural Language Processing

By Huggingface Hub [source]

About this dataset

The MedQuad dataset provides a comprehensive source of medical questions and answers for natural language processing. With over 43,000 patient inquiries from real-life situations categorized into 31 distinct types of questions, the dataset offers an invaluable opportunity to research correlations between treatments, chronic diseases, medical protocols and more. Answers provided in this database come not only from doctors but also other healthcare professionals such as nurses and pharmacists, providing a more complete array of responses to help researchers unlock deeper insights within the realm of healthcare. This incredible trove of knowledge is just waiting to be mined - so grab your data mining equipment and get exploring!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

In order to make the most out of this dataset, start by having a look at the column names and understanding what information they offer: qtype (the type of medical question), Question (the question in itself), and Answer (the expert response). The qtype column will help you categorize the dataset according to your desired question topics. Once you have filtered down your criteria as much as possible using qtype, it is time to analyze the data. Start by asking yourself questions such as “What treatments do most patients search for?” or “Are there any correlations between chronic conditions and protocols?” Then use simple queries such as SELECT Answer FROM MedQuad WHERE qtype='Treatment' AND Question LIKE '%pain%' to get closer to answering those questions.

Once you have obtained new insights about healthcare based on the answers provided in this dynmaic data set - now it’s time for action! Use all that newfound understanding about patient needs in order develop educational materials and implement any suggested changes necessary. If more criteria are needed for querying this data set see if MedQuad offers additional columns; sometimes extra columns may be added periodically that could further enhance analysis capabilities; look out for notifications if these happen.

Finally once making an impact with the use case(s) - don't forget proper citation etiquette; give credit where credit is due!

Research Ideas

Developing medical diagnostic tools that use natural language processing (NLP) to better identify and diagnose health conditions in patients.

Creating predictive models to anticipate treatment options for different medical conditions using machine learning techniques.

Leveraging the dataset to build chatbots and virtual assistants that are able to answer a broad range of questions about healthcare with expert-level accuracy

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:--------------|:------------------------------------------------------| | qtype | The type of medical question. (String) | | Question | The medical question posed by the patient. (String) | | Answer | The expert response to the medical question. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
GSM8K - Grade School Math 8K Q&A
kaggle.com
zip
Updated Nov 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). GSM8K - Grade School Math 8K Q&A [Dataset]. https://www.kaggle.com/datasets/thedevastator/grade-school-math-8k-q-a
Explore at:
zip(3418660 bytes)Available download formats
Dataset updated
Nov 24, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
GSM8K - Grade School Math 8K Q&A

A Linguistically Diverse Dataset for Multi-Step Reasoning Question Answering

By Huggingface Hub [source]

About this dataset

This Grade School Math 8K Linguistically Diverse Training & Test Set is designed to help you develop and improve your understanding of multi-step reasoning question answering. The dataset contains three separate data files: the socratic_test.csv, main_test.csv, and main_train.csv, each containing a set of questions and answers related to grade school math that consists of multiple steps. Each file contains the same columns: question, answer. The questions contained in this dataset are thoughtfully crafted to lead you through the reasoning journey for arriving at the correct answer each time, allowing you immense opportunities for learning through practice. With over 8 thousand entries for both training and testing purposes in this GSM8K dataset, it takes advanced multi-step reasoning skills to ace these questions! Deepen your knowledge today and master any challenge with ease using this amazing GSM8K set!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset provides a unique opportunity to study multi-step reasoning for question answering. The GSM8K Linguistically Diverse Training & Test Set consists of 8,000 questions and answers that have been created to simulate real-world scenarios in grade school mathematics. Each question is paired with one answer based on a comprehensive test set. The questions cover topics such as algebra, arithmetic, probability and more.

The dataset consists of two files: main_train.csv and main_test.csv; the former contains questions and answers specifically related to grade school math while the latter includes multi-step reasoning tests for each category of the Ontario Math Curriculum (OMC). In addition, it has three columns - Question (Question), Answer ([Answer]) – meaning that each row contains 3 sequential question/answer pairs making it possible to take a single path from the start of any given answer or branch out from there according to the logic construction required by each respective problem scenario; these columns can be used in combination with text analysis algorithms like ELMo or BERT to explore different formats of representation for responding accurately during natural language processing tasks such as Q&A or building predictive models for numerical data applications like measuring classifying resource efficiency initiatives or forecasting sales volumes in retail platforms..

To use this dataset efficiently you should first get familiar with its structure by reading through its documentation so you are aware all available info regarding items content definition & format requirements then study examples that best suits your specific purpose whether is performing an experiment inspired by education research needs, generate insights related marketing analytics reports making predictions over artificial intelligence project capacity improvements optimization gains etcetera having full access knowledge about available source keeps you up & running from preliminary background work toward knowledge mining endeavor completion success Support User success qualitative exploration sessions make sure learn all variables definitions employed heterogeneous tools before continue Research journey starts experienced Researchers come prepared valuable resource items employed go beyond discovery false alarm halt advancement flow focus unprocessed raw values instead ensure clear cutting vision behind objectives support UserHelp plans going mean project meaningful campaign deliverables production planning safety milestones dovetail short deliveries enable design interfaces session workforce making everything automated fun entry functioning final transformation awaited offshoot Goals outcome parameters monitor life cycle management ensures ongoing projects feedbacks monitored video enactment resources tapped Proficiently balanced activity sheets tracking activities progress deliberation points evaluation radius highlights outputs primary phase visit egress collaboration agendas Client cumulative returns records capture performance illustrated collectively diarized successive setup sweetens conditions researched environments overview debriefing arcane matters turn acquaintances esteemed directives social

Research Ideas

Training language models for improving accuracy in natural language processing applications such as question answering or dialogue systems.

Generating new grade school math questions and answers using g...
ECE657AW20-ASG4-Coronavirus
kaggle.com
zip
Updated Nov 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MarkCrowley (2025). ECE657AW20-ASG4-Coronavirus [Dataset]. https://www.kaggle.com/markcrowley/ece657aw20asg4coronavirus
Explore at:
zip(1659403 bytes)Available download formats
Dataset updated
Nov 1, 2025
Authors
MarkCrowley
Description
COVID-19 Data for Analysis and Machine Learning

There are lots of datasets online, more growing every day, to help us all get a handle on this pandemic. Here are just a few links to data we've found that students in ECE 657A, and anyone else who finds their way here, can play with and practice their machine learning skills. The main dataset is the COVID-19 dataset from John Hopkins university. This data is perfect for time series analysis and Recurrent Neural Networks, the final topic in the course. This dataset will be left public so anyone can see it but to join you must request the link from Prof. Crowley or be in the ECE 657A W20 course at the University of Waterloo.

For ECE 657A W20 Students

Your bonus grade for assignment 4 comes from creating a kernel from this dataset and writing up some useful analysis and publishing that notebook. You can do any kind of analysis you like but some good places to start are - Analysis: feature extraction and analysis of the data to look for patterns that aren't evident from the original features (this is hard for the simple spread/infection/death data since there aren't that many features) - Other Data: utilize any other datasets in your kernels by loading data about the countries themselves (population, density, wealthy etc.) or their responses to the situation. Tip: If you open a New Notebook related to this dataset you can easily add new data available on Kaggle and link that to you analysis. - HOW'S MY FLATTENING COVID19 DATASET - This dataset has a lot more files and includes a lot of what I was talking about, so if you produce good kernels there you can also count them for your asg4 grade. https://www.kaggle.com/howsmyflattening/covid19-challenges - Predict: make predictions about confirmed cases, deaths, recoveries or other metrics for the future. You can test you models by training on the past and predicting on the following days, then post a prediction for tomorrow or the next few days given ALL the data up to this point. Hopefully the datasets we've linked here will updated automatically so your kernels would update as well. - Create Tasks: you can make your own "Tasks" as part of this kaggle and propose your own solution to it. Then others can try solving it as well. - Groups: students can do this assignment either in the same groups they had for assignment 3 or individually.

Suggest other datasets

We're happy to add other relevant data to this Kaggle, in particular it would be great to integrate live data on the following: - Progression of each country/region/city in "days since X Level" such as Days since 100 confirmed cases, see the link for a great example such a dataset being plotted. I haven't see a live link to a csv of that data, but we could generate. - Mitigation Policies enacted by local governments in each city/region/country. These are dates when that region first enacted Level 1, 2, 3, 4 containment, or started encouraging social distancing or the date when they closed different levels of schools, pubs, restaurants etc. - The hidden positives: this would be a dataset, or method for estimating, as described by Emtiyaz Khan in this twitter thread. The idea is, how many unreported or unconfirmed cases are there in any region, and can we build an estimate of that number using other regions with widespread testing as a baseline and the death rates which are like an observation of a process with a hidden variable or true infection rate. - Paper discussing one way to compute this : https://cmmid.github.io/topics/covid19/severity/global_cfr_estimates.html

Facebook

Twitter

Click to copy link

Link copied

Cite

Muhammad Mudasar Sabir (2025). Titanic Dataset [Dataset]. https://www.kaggle.com/datasets/mudasarsabir/titanic-dataset

Titanic Dataset

Start here! Predict survival on the Titanic and get familiar with ML basics

Explore at:

zip(8350 bytes)Available download formats

Dataset updated

Apr 25, 2025

Authors

Muhammad Mudasar Sabir

Description

Description 👋🛳️ Ahoy, welcome to Kaggle! You’re in the right place. This is the legendary Titanic ML competition – the best, first challenge for you to dive into ML competitions and familiarize yourself with how the Kaggle platform works.

If you want to talk with other users about this competition, come join our Discord! We've got channels for competitions, job postings and career discussions, resources, and socializing with your fellow data scientists. Follow the link here: https://discord.gg/kaggle

The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

Read on or watch the video below to explore more details. Once you’re ready to start competing, click on the "Join Competition button to create an account and gain access to the competition data. Then check out Alexis Cook’s Titanic Tutorial that walks you through step by step how to make your first submission!

The Challenge The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

Recommended Tutorial We highly recommend Alexis Cook’s Titanic Tutorial that walks you through making your very first submission step by step and this starter notebook to get started.

How Kaggle’s Competitions Work Join the Competition Read about the challenge description, accept the Competition Rules and gain access to the competition dataset. Get to Work Download the data, build models on it locally or on Kaggle Notebooks (our no-setup, customizable Jupyter Notebooks environment with free GPUs) and generate a prediction file. Make a Submission Upload your prediction as a submission on Kaggle and receive an accuracy score. Check the Leaderboard See how your model ranks against other Kagglers on our leaderboard. Improve Your Score Check out the discussion forum to find lots of tutorials and insights from other competitors. Kaggle Lingo Video You may run into unfamiliar lingo as you dig into the Kaggle discussion forums and public notebooks. Check out Dr. Rachael Tatman’s video on Kaggle Lingo to get up to speed!

What Data Will I Use in This Competition? In this competition, you’ll gain access to two similar datasets that include passenger information like name, age, gender, socio-economic class, etc. One dataset is titled train.csv and the other is titled test.csv.

Train.csv will contain the details of a subset of the passengers on board (891 to be exact) and importantly, will reveal whether they survived or not, also known as the “ground truth”.

The test.csv dataset contains similar information but does not disclose the “ground truth” for each passenger. It’s your job to predict these outcomes.

Using the patterns you find in the train.csv data, predict whether the other 418 passengers on board (found in test.csv) survived.

Check out the “Data” tab to explore the datasets even further. Once you feel you’ve created a competitive model, submit it to Kaggle to see where your model stands on our leaderboard against other Kagglers.

How to Submit your Prediction to Kaggle Once you’re ready to make a submission and get on the leaderboard:

Click on the “Submit Predictions” button

Upload a CSV file in the submission file format. You’re able to submit 10 submissions a day.

Submission File Format: You should submit a csv file with exactly 418 entries plus a header row. Your submission will show an error if you have extra columns (beyond PassengerId and Survived) or rows.

The file should have exactly 2 columns:

PassengerId (sorted in any order) Survived (contains your binary predictions: 1 for survived, 0 for deceased) Got it! I’m ready to get started. Where do I get help if I need it? For Competition Help: Titanic Discussion Forum Kaggle doesn’t have a dedicated team to help troubleshoot your code so you’ll typically find that you receive a response more quickly by asking your question in the appropriate forum. The forums are full of useful information on the data, metric, and different approaches. We encourage you to use the forums often. If you share your knowledge, you'll find that others will share a lot in turn!

A Last Word on Kaggle Notebooks As we mentioned before, Kaggle Notebooks is our no-setup, customizable, Jupyter Notebooks environment with free GPUs and a huge repository ...

Clear search

Close search

Google apps

Main menu

Titanic Dataset

Coding Questions with Solutions

Coding Questions with Solutions

Introductory, Interview and Competition Levels

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Santa 2024 - The Perplexity

github-final-datasets

Github Clean Code Snippets Dataset

1 Step - Github Samples Database parsing

2 Step - Github Bigquery Database parsing

3 Step - collection of code samples of raw coding samples

4 Step - First and second datasets combining

5 Step - Datasets cleaning from symbols and merging together with rare languages

6 Step - Fixing up the labels

Kaggle Dataset Metadata Repository

Kaggle Dataset Metadata Collection 📊

Dataset Overview:

Columns Description 📋

📊 Meta Kaggle| Kaggle Users' Stats

Image

History

Context

Note

Notebook updater

Image

CMM202 Topic 0

Dataset

Contents

Job Offers Web Scraping Search

Job Offers Web Scraping Search

Targeted Results to Find the Optimal Work Solution

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Music Streaming Customer Churn

Overview

What you need to do?

How to get started?

Files

Columns

An Analysis of Engineering-as-Marketing Tools

An Analysis of Engineering-as-Marketing Tools

Strategies for Expanding Business Reach

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Additional Data [Predict Students Performance]

Context

Generating Script

Content

Game Quits

Motivation

hm-atomic-interation-with-item-feature

Context

Content

Malware DataSet

Dataset

Contents

Illinois Insurance Producers Data

Illinois Insurance Producers Data

Their Last Name/Business Name, Address, and More