Facebook
TwitterDescription 👋🛳️ Ahoy, welcome to Kaggle! You’re in the right place. This is the legendary Titanic ML competition – the best, first challenge for you to dive into ML competitions and familiarize yourself with how the Kaggle platform works.
If you want to talk with other users about this competition, come join our Discord! We've got channels for competitions, job postings and career discussions, resources, and socializing with your fellow data scientists. Follow the link here: https://discord.gg/kaggle
The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.
Read on or watch the video below to explore more details. Once you’re ready to start competing, click on the "Join Competition button to create an account and gain access to the competition data. Then check out Alexis Cook’s Titanic Tutorial that walks you through step by step how to make your first submission!
The Challenge The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
Recommended Tutorial We highly recommend Alexis Cook’s Titanic Tutorial that walks you through making your very first submission step by step and this starter notebook to get started.
How Kaggle’s Competitions Work Join the Competition Read about the challenge description, accept the Competition Rules and gain access to the competition dataset. Get to Work Download the data, build models on it locally or on Kaggle Notebooks (our no-setup, customizable Jupyter Notebooks environment with free GPUs) and generate a prediction file. Make a Submission Upload your prediction as a submission on Kaggle and receive an accuracy score. Check the Leaderboard See how your model ranks against other Kagglers on our leaderboard. Improve Your Score Check out the discussion forum to find lots of tutorials and insights from other competitors. Kaggle Lingo Video You may run into unfamiliar lingo as you dig into the Kaggle discussion forums and public notebooks. Check out Dr. Rachael Tatman’s video on Kaggle Lingo to get up to speed!
What Data Will I Use in This Competition? In this competition, you’ll gain access to two similar datasets that include passenger information like name, age, gender, socio-economic class, etc. One dataset is titled train.csv and the other is titled test.csv.
Train.csv will contain the details of a subset of the passengers on board (891 to be exact) and importantly, will reveal whether they survived or not, also known as the “ground truth”.
The test.csv dataset contains similar information but does not disclose the “ground truth” for each passenger. It’s your job to predict these outcomes.
Using the patterns you find in the train.csv data, predict whether the other 418 passengers on board (found in test.csv) survived.
Check out the “Data” tab to explore the datasets even further. Once you feel you’ve created a competitive model, submit it to Kaggle to see where your model stands on our leaderboard against other Kagglers.
How to Submit your Prediction to Kaggle Once you’re ready to make a submission and get on the leaderboard:
Click on the “Submit Predictions” button
Upload a CSV file in the submission file format. You’re able to submit 10 submissions a day.
Submission File Format: You should submit a csv file with exactly 418 entries plus a header row. Your submission will show an error if you have extra columns (beyond PassengerId and Survived) or rows.
The file should have exactly 2 columns:
PassengerId (sorted in any order) Survived (contains your binary predictions: 1 for survived, 0 for deceased) Got it! I’m ready to get started. Where do I get help if I need it? For Competition Help: Titanic Discussion Forum Kaggle doesn’t have a dedicated team to help troubleshoot your code so you’ll typically find that you receive a response more quickly by asking your question in the appropriate forum. The forums are full of useful information on the data, metric, and different approaches. We encourage you to use the forums often. If you share your knowledge, you'll find that others will share a lot in turn!
A Last Word on Kaggle Notebooks As we mentioned before, Kaggle Notebooks is our no-setup, customizable, Jupyter Notebooks environment with free GPUs and a huge repository ...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
Codeparrot's
Appsdataset provides an invaluable tool for coders of all levels to effectively learn and fully understand the programming language of Python. Through a comprehensive collection of programming questions accompanied by detailed solutions, input/output test cases, and related information written in Python, aspiring coders can quickly explore the mysterious depths of coding with confidence. Comprised of natural language questions alongside their respective solutions in Python, this dataset is a perfect starting point for coders looking to unlock the hidden power behind coding - the ability to create something from nothing! Take your first steps today with Codeparrot'sAppsdataset and discover how much you can achieve within this powerful language as you continue your journey into programming
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
Using this dataset is fairly simple given all its columns are neatly organized in an orderly fashion within one table or CSV file format – unless otherwise stated – so reading through them should be straightforward enough even with minimal coding experience or knowledge on one’s part. All a user needs do is find an appropriate question according to their desired difficulty rating or topic along with correct accompanying information pertains to it - including any relevant starter code provided - then copy & paste those into their local environment while running their test cases supplied against provided input & output values thus verifying if everything works correctly before executing one's own personal modifications or additions in order attempt respond accurately & appropriately at best they can according problem instructions accordingly afterwards sending back response for review/feedback if necessary after completion whenever warranted too appropriately doing so also properly prepare ahead time due additional practicing before appearing within official competitive situations such become quite helpful unexpectedly even unexpected too become thenceforth potentially wholly rewarding unto every learner able put themselves situation whenever likewise opportunity arises successful results inevitably follow then shortly thereafter forget ever worry remaining confused regarding specific matters no more again either whatsoever might prevail occur overnight thus proficiency gained soonest possible manner instead slowly pertaining continuously arduously least preventing further confusion sudden cognitive storms still moreover due accidentally prematurely construed assumptions choosing take broadknowledged approach learning basics saves boatloads sanity time money ultimately goes much further everybody
- As a teaching and learning tool to help beginners get comfortable with programming in Python.
- As an interview prep tool for experienced coders, since it contains level-specific code examples and test cases for each question.
- As a competition resource, wherein contestants can try out different solutions and compare them against each other to identify the most efficient one within the given data set
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:-----------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | question | Natural language question related to coding. (String) | | solutions | Set of solutions written in Python for each given question. (String) | | **inp...
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Participating in Kaggle's "Santa 2024 - The Perplexity Permutation Puzzle" is an excellent opportunity to enhance your data science skills. Here's a step-by-step guide to help you get started:
Create a Kaggle Account:
If you haven't already, sign up for a free account on Kaggle. Join the Competition:
Navigate to the Santa 2024 competition page. Click the "Join Competition" button to register. Review and accept the competition rules. Understand the Problem Statement:
Read the Overview and Description sections to grasp the challenge. The task involves helping Rudolph descramble holiday-related words to make the Large Language Models (LLMs) happy. Review the Evaluation Metric:
Understand how your submissions will be scored by reading the evaluation criteria provided on the competition page. Download the Dataset:
Go to the "Data" tab on the competition page. Download the provided datasets, which include training and test data. Set Up Your Development Environment:
Install necessary tools such as Python and Jupyter Notebook. Consider using Kaggle Kernels, which provide a cloud-based environment with pre-installed libraries. Explore the Data:
Load the dataset and perform exploratory data analysis (EDA) to understand its structure and contents. Identify patterns or anomalies that could inform your modeling approach. Develop a Baseline Model:
Start with a simple model to establish a performance baseline. This could involve basic algorithms or heuristics to unscramble words. Feature Engineering:
Create new features that might improve model performance, such as letter frequency analysis or positional encoding. Model Selection and Training:
Experiment with various machine learning models, such as decision trees, support vector machines, or neural networks. Train your models using the training dataset. Model Evaluation:
Assess your models using appropriate metrics to ensure they generalize well to unseen data. Use cross-validation techniques to validate your models. Optimize and Tune Hyperparameters:
Fine-tune your model's hyperparameters to enhance performance. Consider using grid search or randomized search methods. Prepare and Submit Predictions:
Generate predictions on the test dataset. Format your submission file as specified in the competition guidelines. Upload your submission through the "Submit Predictions" button on the competition page. Monitor the Leaderboard:
After submission, check the leaderboard to see how your model ranks against others. Use this feedback to iterate and improve your model. Engage with the Community:
Participate in the discussion forums to share insights and learn from fellow competitors. Review shared code and solutions to gain new perspectives. Adhere to the Timeline:
Be mindful of key dates: Start Date: November 21, 2024 Entry Deadline: January 24, 2025 Final Submission Deadline: January 24, 2025 For additional guidance, consider reading articles like "My First Real Kaggle Competition — A step by step guide for beginners, from a beginner" and "Dive into the World of Kaggle Competitions: A Step-by-Step Guide".
By following these steps and utilizing available resources, you'll be well on your way to successfully participating in the Santa 2024 competition.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Here is a description, how the datasets for a training notebook used for Telegram ML Contest solution were prepared.
The first part of the code samples was taken from a private version of this notebook.
Here is the statistics about classes of programming languages from Github Code Snippets database
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2F2fdc091661198e80559f8cb1d1a306ff%2FScreenshot%202023-11-07%20at%2021.24.42.png?generation=1699390166413391&alt=media" alt="">
From this database, 2 csv files were created - with 50000 code samples for each of the 20 programming languages included, with equal by numbers and stratified sampling. The files related here are sample_equal_prop_50000.csv and sample_equal_prop_50000.csv and sample_stratified_50000.csv, respectively.
Second option for capturing out additional examples was to run this notebook with making up larger amount of queries, 10000.
The resulted file is dataset-10000.csv - included to the data card
The statistics for the code programming languages is as on the next chart - it has 32 labeled classes
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2F7c04342da8ec1df266cd90daf00204f9%2FScreenshot%202023-10-13%20at%2020.52.13.png?generation=1699392769199533&alt=media" alt="">
To get a model more robust, code samples of 20 additional languages were collected in amount from 10 till 15 samples on more-less popular use cases. Also, for the class "OTHER", like regular language examples, according to the task of the competition, the text examples from this dataset with promts on Huggingface were added to the file. The resulted file here is rare_languages.csv - also in data card
The statistics for rare languages code snippets is as follows:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2F0b340781c774d2acb988ce1567f4afa3%2FScreenshot%202023-11-08%20at%2001.13.07.png?generation=1699402436798661&alt=media" alt="">
For this stage of dataset creation, the number of the columns in sample_equal_prop_50000.csv and sample_stratified_50000.csv was cut out just for 2 - "snippet", "language", the version of file with equal numbers is in the data card - sample_equal_prop_50000_clean.csv
To prepare Bigquery dataset file, the column with index was cut out, and the column "content" was renamed to "snippet". These changes were saved in dataset-10000-clean.csv
After that, the files sample_equal_prop_50000_clean.csv and dataset-10000-clean.csv were combined together and saved as github-combined-file.csv
The prepared files took too much RAM to be read by Pandas library, so that is why additional prepocessing has been made - the symbols like quatas, commas, ampersands, new lines and adding tabs characters were cleaned out. After clieaning, the flies were merged with rare_languages.csv file and saved as github-combined-file-no-symbols-rare-clean.csv and sample_equal_prop_50000_-no-symbols-rare-clean.csv, respectively.
The final distribution of classes turned out to be the next one
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F833757%2Ff43e0cea4c565c9f7c808527b0dfa2da%2FScreenshot%202023-11-09%20at%2020.26.30.png?generation=1699558064765454&alt=media" alt="">
To be suitable for TF-DF format, to each programming language a certain label was given as well. The final labels are in the data card.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13367141%2F444a868e669671faf9007822d6f2d348%2FAdd%20a%20heading.png?generation=1731775788329917&alt=media" alt="">
This dataset provides comprehensive metadata on various Kaggle datasets, offering detailed information about the dataset owners, creators, usage statistics, licensing, and more. It can help researchers, data scientists, and Kaggle enthusiasts quickly analyze the key attributes of different datasets on Kaggle. 📚
datasetUrl 🌐: The URL of the Kaggle dataset page. This directs you to the specific dataset's page on Kaggle.
ownerAvatarUrl 🖼️: The URL of the dataset owner's profile avatar on Kaggle.
ownerName 👤: The name of the dataset owner. This can be the individual or organization that created and maintains the dataset.
ownerUrl 🌍: A link to the Kaggle profile page of the dataset owner.
ownerUserId 💼: The unique user ID of the dataset owner on Kaggle.
ownerTier 🎖️: The ownership tier, such as "Tier 1" or "Tier 2," indicating the owner's status or level on Kaggle.
creatorName 👩💻: The name of the dataset creator, which could be different from the owner.
creatorUrl 🌍: A link to the Kaggle profile page of the dataset creator.
creatorUserId 💼: The unique user ID of the dataset creator.
scriptCount 📜: The number of scripts (kernels) associated with this dataset.
scriptsUrl 🔗: A link to the scripts (kernels) page for the dataset, where you can explore related code.
forumUrl 💬: The URL to the discussion forum for this dataset, where users can ask questions and share insights.
viewCount 👀: The number of views the dataset page has received on Kaggle.
downloadCount ⬇️: The number of times the dataset has been downloaded by users.
dateCreated 📅: The date when the dataset was first created and uploaded to Kaggle.
dateUpdated 🔄: The date when the dataset was last updated or modified.
voteButton 👍: The metadata for the dataset's vote button, showing how users interact with the dataset's quality ratings.
categories 🏷️: The categories or tags associated with the dataset, helping users filter datasets based on topics of interest (e.g., "Healthcare," "Finance").
licenseName 🛡️: The name of the license under which the dataset is shared (e.g., "CC0," "MIT License").
licenseShortName 🔑: A short form or abbreviation of the dataset's license name (e.g., "CC0" for Creative Commons Zero).
datasetSize 📦: The size of the dataset in terms of storage, typically measured in MB or GB.
commonFileTypes 📂: A list of common file types included in the dataset (e.g., .csv, .json, .xlsx).
downloadUrl ⬇️: A direct link to download the dataset files.
newKernelNotebookUrl 📝: A link to a new kernel or notebook related to this dataset, for those who wish to explore it programmatically.
newKernelScriptUrl 💻: A link to a new script for running computations or processing data related to the dataset.
usabilityRating 🌟: A rating or score representing how usable the dataset is, based on user feedback.
firestorePath 🔍: A reference to the path in Firestore where this dataset’s metadata is stored.
datasetSlug 🏷️: A URL-friendly version of the dataset name, typically used for URLs.
rank 📈: The dataset's rank based on certain metrics (e.g., downloads, votes, views).
datasource 🌐: The source or origin of the dataset (e.g., government data, private organizations).
medalUrl 🏅: A URL pointing to the dataset's medal or badge, indicating the dataset's quality or relevance.
hasHashLink 🔗: Indicates whether the dataset has a hash link for verifying data integrity.
ownerOrganizationId 🏢: The unique organization ID of the dataset's owner if the owner is an organization rather than an individual.
totalVotes 🗳️: The total number of votes the dataset has received from users, reflecting its popularity or quality.
category_names 📑: A comma-separated string of category names that represent the dataset’s classification.
This dataset is a valuable resource for those who want to analyze Kaggle's ecosystem, discover high-quality datasets, and explore metadata in a structured way. 🌍📊
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1842206%2Ff84a67b64934ccfdd6fd4bfc24db094d%2F_982f849a-87df-44ff-94ff-3fc97c6198aa-small2.jpeg?generation=1738169001850229&alt=media" alt="">
Model, Dataset, and Notebook, rather than the creation date of the very first version. I also added the reaction counts which was a new csv added in the MetaKaggle dataset. The discussion can be found here . I also added versions created for Model, Notebook, and Dataset to properly track users that are updating their datasets.ModelUpvotesGiven and ModelUpvotesReceived values being identicalUser aggregated stats and data using the Official Meta Kaggle dataset
Expect some discrepancies between the counts seen in your profile, because, aside from there is a lag of one to two days before a new dataset is published, some information such as Kaggle staffs' upvotes and private competitions are not included. But for almost all members, the figures should reconcile
📊 (Scheduled) Meta Kaggle Users' Stats
Generated with Bing image generator
Facebook
TwitterThis dataset was created by Carlos Moreno-Garcia
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
This dataset collects job offers from web scraping which are filtered according to specific keywords, locations and times. This data gives users rich and precise search capabilities to uncover the best working solution for them. With the information collected, users can explore options that match with their personal situation, skillset and preferences in terms of location and schedule. The columns provide detailed information around job titles, employer names, locations, time frames as well as other necessary parameters so you can make a smart choice for your next career opportunity
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset is a great resource for those looking to find an optimal work solution based on keywords, location and time parameters. With this information, users can quickly and easily search through job offers that best fit their needs. Here are some tips on how to use this dataset to its fullest potential:
Start by identifying what type of job offer you want to find. The keyword column will help you narrow down your search by allowing you to search for job postings that contain the word or phrase you are looking for.
Next, consider where the job is located – the Location column tells you where in the world each posting is from so make sure it’s somewhere that suits your needs!
Finally, consider when the position is available – look at the Time frame column which gives an indication of when each posting was made as well as if it’s a full-time/ part-time role or even if it’s a casual/temporary position from day one so make sure it meets your requirements first before applying!
Additionally, if details such as hours per week or further schedule information are important criteria then there is also info provided under Horari and Temps Oferta columns too! Now that all three criteria have been ticked off - key words, location and time frame - then take a look at Empresa (Company Name) and Nom_Oferta (Post Name) columns too in order to get an idea of who will be employing you should you land the gig!
All these pieces of data put together should give any motivated individual all they need in order to seek out an optimal work solution - keep hunting good luck!
- Machine learning can be used to groups job offers in order to facilitate the identification of similarities and differences between them. This could allow users to specifically target their search for a work solution.
- The data can be used to compare job offerings across different areas or types of jobs, enabling users to make better informed decisions in terms of their career options and goals.
- It may also provide an insight into the local job market, enabling companies and employers to identify where there is potential for new opportunities or possible trends that simply may have previously gone unnoticed
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: web_scraping_information_offers.csv | Column name | Description | |:-----------------|:------------------------------------| | Nom_Oferta | Name of the job offer. (String) | | Empresa | Company offering the job. (String) | | Ubicació | Location of the job offer. (String) | | Temps_Oferta | Time of the job offer. (String) | | Horari | Schedule of the job offer. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
For this project, you are a Business Analyst working for a Music Streaming Service called PlaylistPro.
At PlaylistPro, one of company's top priorities is to reduce customer churn (stop customers from cancelling their subscriptions).
To reduce churn, PlaylistPro plans to reduce the cost of the music streaming service for customers that are likely to cancel their subscription. But, the company currently does not know which customers are likely or unlikely to cancel their subscription.
Thus, PlaylistPro would like you to build a Supervised Classification Model to predict if a customer will churn based on their subscription information and listening habits. To help you build a model, the company's Operations team created a dataset of 40,000 customers to train your model on. More details are below:
PlaylistPro is asking you to complete the following 3 tasks:
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Ian Greenleigh [source]
The engineering-as-marketing tools available today allow startups to maximize and take advantage of the engineering talents they possess. By creating useful tools such as calculators, widgets and microsites, businesses can get in front of potential customers and lead them to their products or services.
This dataset provides a comprehensive list of companies who are using engineering as a marketing strategy and the respective tools these companies have created for it. For each company you get information about their name, product/service, tool name, what the tool does and a URL for further information about it. Additionally there is an extra notes field providing more details about each company’s market habit or any other additional facts that could be relevant in understanding better the use cases these companies are leading with this new way of doing marketing through engineering driven strategies.
With this data you will be able to take a closer look at how effectively this strategy is working while being able to compare different approaches taken inside each industry vertical in order to maximize conversions among leads generated by all these amazing pieces work made possible by software engineers everywhere devoted every day making our lives easier constantly!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
Analyzing this data allows users to gain insights into how successful companies are using engineering-as-marketing techniques to generate leads and expand their customer base. It also provides a valuable resource for other organizations wanting to learn more about how other organizations have achieved success with such practices.
This dataset can be used in many ways such as:
- Analyzing different trends in which engineering-as-marketing techniques are being used across multiple industries
- Examining whether certain techniques lead to higher lead generation or increased customer base
Comparing effectiveness between companies using different types of tools etc.
To get started with this dataset, simply load it up into some kind of data analysis software package that supports csv file processing capabilities such as Tableau or R Studio. Then define each column appropriately by adding appropriate labels onto them so that they can be understood easily when looked at from a first glance perspective by yourself or other members on your team who are looking over your datasets before any analyses start happening on those files within your chosen data analysis software package . Now you should be all set up for analyzing this dataset!
- Leveraging this data to understand the effectiveness of engineering-as-marketing for various companies.
- Creating a sentiment analysis of customers’ responses to engineering-as-marketing tools in order to determine which tools are most popular and successful.
- Analyzing what types of engineering-as-marketing tools have been most successful with specific customer segments, to inform future product development and marketing tactics
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: Engineering as Marketing.csv | Column name | Description | |:-------------------|:-------------------------------------------------------------------| | Company name | The name of the company. (String) | | What co does | A brief description of what the company does. (String) | | Tool name | The name of the engineering-as-marketing tool. (String) | | What tool does | A brief description of what the tool does. (String) | | URL | The URL of the engineering-as-marketing tool. (String) | | Notes | Additional notes about the engineering-as-marketing tool. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Ian Greenleigh.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created for the competition "Predict Student Performance from Game Play" which aims to predict student performance during game-based learning in real-time based on their game logs. The dataset's source raw data is available on the developers's site, which can be used as supplemental data. The idea for this dataset was discovered in this discussion.
To extract the data, I used my notebook.
The dataset consists of two file types: 1. Files with train data (_train suffix) 2. Files with labels (_labels suffix) for each non-empty monthly dataset and its ID. There are 20 monthly datasets available on the mentioned site.
I tried to replicate the competition's data format as closely as possible, which involved:
I also added save codes, so you can find out if players started from one of the saves. As I know in competition's dataset all players started from the beggining so you may like to ignore players, who use save codes.
One interesting aspect of the raw data is that it includes users who quit the game before it ended and may have stopped playing before completing a quiz. I only included users who passed at least the first quiz, which opens up possibilities to supplement data for the first level group, which has the least amount of features.
Implementing all the new logic with this dataset into pipelines may be difficult, and increasing train size may lead to memory errors. Additionally, some sessions are already present in the competition and must be ignored.
I am sharing this dataset with the Kaggle community because I have university exams and do not have enough time to make the implementation myself. However, I believe that supplemental data with proper data cleaning techniques will greatly boost performance. Good luck!
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
I have create a notebook showed how to create recommendation in this competition: https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations
For using Recbole, you need to create atomic file first: https://recbole.io/docs/user_guide/data/atomic_files.html
In this dataset, i have already creating atomic file for you, so you can download and directly use it with recbole
If you want a tutorial for using sequential model with item features, you can check my notebook: https://www.kaggle.com/astrung/recbole-lstm-sequential-with-item-features if you want a notebook using only iterations, without user feautures, you can use this notebook: https://www.kaggle.com/astrung/recbole-lstm-for-recomendation
This data contains atomic files of all of interactions and item features data for this competition: You can use this data with Recbole for faster creating recomendation engine. You can use my notebook as a guide for using Recbole to create recommendation result and submit it
Facebook
TwitterThis dataset was created by Oronzo Comi
It contains the following files:
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By data.world's Admin [source]
This dataset contains vital information about insurance producers in Illinois. It includes detailed and comprehensive data relating to the last name or business name, first name, mailing address line 1 and 2, city, state, zip code and Line of Authority (LOA). This extensive dataset is a great source of information for researchers who are interested in understanding the insurance production industry in Illinois. With its up-to-date data points that cover all aspects of the insurance producer market in Illinois, this dataset can be used by individuals to make informed decisions about the insurance risk coverage options within their state. The set is timely updated so you can be sure that you are getting an accurate picture of the market landscape today!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset contains information about insurance producers in Illinois, which includes their last name or business name, first name, mailing address, city, state, zip code and Line of Authority. With this data set you can use it to get a better understanding of the insurance industry in Illinois and learn more about the population of producers.
Guidelines for using this dataset
- Inspect the data: Before starting your analysis take some time to go over all columns included in the dataset and make sure they are understandable and relevant to your objectives.
- Cleaning the Data: Depending on your needs you may find it necessary to clean up and/or transform some of the data so that you can more easily analyze it. Take caution when cleaning or transforming as any changes may have an effect on your outcome later on during analysis so make sure that what you do translates accurately into meaningful insights rather than incorrect conclusions due to mistake manipulation nof the data set
- Analyze: Start by looking through descriptive statistics such as aggregate values (mean/median) or frequencies (counts/percentages) for each field or combination of fields from which one can draw valid insights from. You might then wish to tackle deeper deeper analytical questions based off a few hypotheses such as correlations between two variables etc.. Make sure also at all times verify assumptions with evidence from provided datasets
- Report: Prepare a summary report including any additional analysis recommendations based upon findngs drawn from both descriptive statistics as well as deeper analytic work done regarding potential correlation between variables
- Using the insurance producer dataset and geographic data, create an interactive map of Illinois to visualize where the most insurance producers are located with detailed country, state and city data.
Generate a report to provide insights on which Insurance Producers have concentrated Lines of Authority in a certain area or across multiple states in order to identify any emerging trends in insurance markets or areas in need of additional coverage options
Leverage AI algorithms and machine learning techniques to create a predictive model that predicts which Lines of Authority will be more successful for producers operating in certain geographical areas based on their past performance, demographic information etc
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: doi-insurance-producers-1.csv | Column name | Description | |:-------------------------------|:---------------------------------------------------------------| | LAST_NAME_OR_BUSINESS_NAME | Last name or business name of the insurance producer. (String) | | FIRST_NAME | First name of the insurance producer. (String) | | MLG_ADDRESS1 | Mailing address line 1 of the insurance producer. (String) | | MLG_ADDRESS2 | Mailing address line 2 of the insurance producer. (String) | | MAILING_CITY | City of the insurance produc...
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
In this notebook, we will walk through solving a complete machine learning problem using a real-world dataset. This was a "homework" assignment given to me for a job application over summer 2018. The entire assignment can be viewed here and the one sentence summary is:
Use the provided building energy data to develop a model that can predict a building's Energy Star score, and then interpret the results to find the variables that are most predictive of the score.
This is a supervised, regression machine learning task: given a set of data with targets (in this case the score) included, we want to train a model that can learn to map the features (also known as the explanatory variables) to the target.
Supervised problem: we are given both the features and the target Regression problem: the target is a continous variable, in this case ranging from 0-100 During training, we want the model to learn the relationship between the features and the score so we give it both the features and the answer. Then, to test how well the model has learned, we evaluate it on a testing set where it has never seen the answers!
Machine Learning Workflow Although the exact implementation details can vary, the general structure of a machine learning project stays relatively constant:
Data cleaning and formatting Exploratory data analysis Feature engineering and selection Establish a baseline and compare several machine learning models on a performance metric Perform hyperparameter tuning on the best model to optimize it for the problem Evaluate the best model on the testing set Interpret the model results to the extent possible Draw conclusions and write a well-documented report Setting up the structure of the pipeline ahead of time lets us see how one step flows into the other. However, the machine learning pipeline is an iterative procedure and so we don't always follow these steps in a linear fashion. We may revisit a previous step based on results from further down the pipeline. For example, while we may perform feature selection before building any models, we may use the modeling results to go back and select a different set of features. Or, the modeling may turn up unexpected results that mean we want to explore our data from another angle. Generally, you have to complete one step before moving on to the next, but don't feel like once you have finished one step the first time, you cannot go back and make improvements!
This notebook will cover the first three (and a half) steps of the pipeline with the other parts discussed in two additional notebooks. Throughout this series, the objective is to show how all the different data science practices come together to form a complete project. I try to focus more on the implementations of the methods rather than explaining them at a low-level, but have provided resources for those who want to go deeper. For the single best book (in my opinion) for learning the basics and implementing machine learning practices in Python, check out Hands-On Machine Learning with Scikit-Learn and Tensorflow by Aurelion Geron.
With this outline in place to guide us, let's get started!
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
This dataset contains a collection of data science job offers from the European EURAXESS database. It includes detailed information on the job title, salary, position type, location, sector and company name - allowing you to see the kind of opportunities available if you pursue a career in data science. With this comprehensive set of data points at your disposal, it's easy to explore highly diverse roles and compare different employers to find the right fit for you; all while gaining valuable insight into recent hiring trends in the European Union's labor market! Whether you are thinking about taking your first steps into Data Science or are already experienced in this field, this dataset provides an up-to-date referential source helping better align your professional aspirations with actual opportunities
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset contains job offers for data scientists in the EURAXESS database. It includes relevant information such as company name, job title, salary, location and job description.
The dataset can be used to get an understanding of current trends in data science jobs and salaries in Europe. This can help individuals or companies determine where to focus their resources or look for new data science opportunities.
To start using this dataset, we recommend taking a look at the columns first. There are five main columns - company name, job title, salary (where available), location and job description - which provide detailed information about each individual offer for a data scientist position. By examining these attributes of each position you’ll be able to understand the different requirements for each role across various European countries and begin formulating your search strategy from there.
When considering specific offers within this dataset it's important to consider not just the physical location but other aspects such as potential growth opportunities within organizations or desired levels of seniority regarding developing/applying models with complex datasets as well as fluctuating demands of managing fast-paced projects with tight deadlines etc…so it's advised to read through all of the details provided when evaluating opportunities specifically tailored to your needs accordingly.
If you’re looking beyond just salary numbers though then keep an open mind when examining all available positions since while money is always important; things like more vacation days or flexible working hours may fit well into personal priorities too! Ultimately it's up to you to decide on what parameters work best for you when locating a suitable role via this dataset according to your criteria; financials aside being sure that any prospective employer meets certain standards in terms of coding/database frameworks & principles expected from prospective employees also provides great peace of mind towards landing successful & long-term endeavors so never forget that small detail whilst narrowing down selections!
- Analyzing the language preferences specified in data science job offers in EURAXESS to gain insight into the language requirements of the data science market across different European countries.
- Comparing salary averages between job postings within EURAXESS to identify potential discrepancies between wages paid for similar positions across countries or differences in job requirements at a given pay grade.
- Identifying trends in other special qualifications (e.g., degree, certification) required for data scientist roles within EURAXESS compared to other similar datasets from other regions such as North America, Asia, etc
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: Data Scientist.csv
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Context & Motivation
This dataset provides a comprehensive, self-contained offline installer for the vllm library, a high-throughput engine for LLM inference. It is specifically designed to solve the common "no internet access" problem in Kaggle competitions like the ARC Prize, where packages must be installed from local files. Using this dataset eliminates pip install failures and ensures a consistent, reproducible environment for your submission notebook.
Content The dataset contains a single directory, vllm_wheels, which includes the Python wheel file for vllm==0.9.2 and all of its required dependencies. These files were downloaded and packaged in a standard Kaggle environment to ensure maximum compatibility with the competition's execution environment (Python 3.10, CUDA 12.x).
Usage To use this dataset in your Kaggle notebook (with internet turned OFF):
import os
# --- vLLM Offline Installation ---
# Path to the directory containing the wheel files
WHEELS_PATH = "/kaggle/input/vllm-0-9-2-offline-installer/vllm_wheels"
print("Starting offline installation of vLLM...")
!pip install --no-index --find-links={WHEELS_PATH} vllm
print("Installation complete.")
# Verify the installation
import vllm
print(f"vLLM version {vllm._version_} successfully installed.")
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
The MedQuad dataset provides a comprehensive source of medical questions and answers for natural language processing. With over 43,000 patient inquiries from real-life situations categorized into 31 distinct types of questions, the dataset offers an invaluable opportunity to research correlations between treatments, chronic diseases, medical protocols and more. Answers provided in this database come not only from doctors but also other healthcare professionals such as nurses and pharmacists, providing a more complete array of responses to help researchers unlock deeper insights within the realm of healthcare. This incredible trove of knowledge is just waiting to be mined - so grab your data mining equipment and get exploring!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
In order to make the most out of this dataset, start by having a look at the column names and understanding what information they offer: qtype (the type of medical question), Question (the question in itself), and Answer (the expert response). The qtype column will help you categorize the dataset according to your desired question topics. Once you have filtered down your criteria as much as possible using qtype, it is time to analyze the data. Start by asking yourself questions such as “What treatments do most patients search for?” or “Are there any correlations between chronic conditions and protocols?” Then use simple queries such as SELECT Answer FROM MedQuad WHERE qtype='Treatment' AND Question LIKE '%pain%' to get closer to answering those questions.
Once you have obtained new insights about healthcare based on the answers provided in this dynmaic data set - now it’s time for action! Use all that newfound understanding about patient needs in order develop educational materials and implement any suggested changes necessary. If more criteria are needed for querying this data set see if MedQuad offers additional columns; sometimes extra columns may be added periodically that could further enhance analysis capabilities; look out for notifications if these happen.
Finally once making an impact with the use case(s) - don't forget proper citation etiquette; give credit where credit is due!
- Developing medical diagnostic tools that use natural language processing (NLP) to better identify and diagnose health conditions in patients.
- Creating predictive models to anticipate treatment options for different medical conditions using machine learning techniques.
- Leveraging the dataset to build chatbots and virtual assistants that are able to answer a broad range of questions about healthcare with expert-level accuracy
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:--------------|:------------------------------------------------------| | qtype | The type of medical question. (String) | | Question | The medical question posed by the patient. (String) | | Answer | The expert response to the medical question. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This Grade School Math 8K Linguistically Diverse Training & Test Set is designed to help you develop and improve your understanding of multi-step reasoning question answering. The dataset contains three separate data files: the socratic_test.csv, main_test.csv, and main_train.csv, each containing a set of questions and answers related to grade school math that consists of multiple steps. Each file contains the same columns:
question,answer. The questions contained in this dataset are thoughtfully crafted to lead you through the reasoning journey for arriving at the correct answer each time, allowing you immense opportunities for learning through practice. With over 8 thousand entries for both training and testing purposes in this GSM8K dataset, it takes advanced multi-step reasoning skills to ace these questions! Deepen your knowledge today and master any challenge with ease using this amazing GSM8K set!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset provides a unique opportunity to study multi-step reasoning for question answering. The GSM8K Linguistically Diverse Training & Test Set consists of 8,000 questions and answers that have been created to simulate real-world scenarios in grade school mathematics. Each question is paired with one answer based on a comprehensive test set. The questions cover topics such as algebra, arithmetic, probability and more.
The dataset consists of two files: main_train.csv and main_test.csv; the former contains questions and answers specifically related to grade school math while the latter includes multi-step reasoning tests for each category of the Ontario Math Curriculum (OMC). In addition, it has three columns - Question (Question), Answer ([Answer]) – meaning that each row contains 3 sequential question/answer pairs making it possible to take a single path from the start of any given answer or branch out from there according to the logic construction required by each respective problem scenario; these columns can be used in combination with text analysis algorithms like ELMo or BERT to explore different formats of representation for responding accurately during natural language processing tasks such as Q&A or building predictive models for numerical data applications like measuring classifying resource efficiency initiatives or forecasting sales volumes in retail platforms..
To use this dataset efficiently you should first get familiar with its structure by reading through its documentation so you are aware all available info regarding items content definition & format requirements then study examples that best suits your specific purpose whether is performing an experiment inspired by education research needs, generate insights related marketing analytics reports making predictions over artificial intelligence project capacity improvements optimization gains etcetera having full access knowledge about available source keeps you up & running from preliminary background work toward knowledge mining endeavor completion success Support User success qualitative exploration sessions make sure learn all variables definitions employed heterogeneous tools before continue Research journey starts experienced Researchers come prepared valuable resource items employed go beyond discovery false alarm halt advancement flow focus unprocessed raw values instead ensure clear cutting vision behind objectives support UserHelp plans going mean project meaningful campaign deliverables production planning safety milestones dovetail short deliveries enable design interfaces session workforce making everything automated fun entry functioning final transformation awaited offshoot Goals outcome parameters monitor life cycle management ensures ongoing projects feedbacks monitored video enactment resources tapped Proficiently balanced activity sheets tracking activities progress deliberation points evaluation radius highlights outputs primary phase visit egress collaboration agendas Client cumulative returns records capture performance illustrated collectively diarized successive setup sweetens conditions researched environments overview debriefing arcane matters turn acquaintances esteemed directives social
- Training language models for improving accuracy in natural language processing applications such as question answering or dialogue systems.
- Generating new grade school math questions and answers using g...
Facebook
TwitterThere are lots of datasets online, more growing every day, to help us all get a handle on this pandemic. Here are just a few links to data we've found that students in ECE 657A, and anyone else who finds their way here, can play with and practice their machine learning skills. The main dataset is the COVID-19 dataset from John Hopkins university. This data is perfect for time series analysis and Recurrent Neural Networks, the final topic in the course. This dataset will be left public so anyone can see it but to join you must request the link from Prof. Crowley or be in the ECE 657A W20 course at the University of Waterloo.
Your bonus grade for assignment 4 comes from creating a kernel from this dataset and writing up some useful analysis and publishing that notebook. You can do any kind of analysis you like but some good places to start are - Analysis: feature extraction and analysis of the data to look for patterns that aren't evident from the original features (this is hard for the simple spread/infection/death data since there aren't that many features) - Other Data: utilize any other datasets in your kernels by loading data about the countries themselves (population, density, wealthy etc.) or their responses to the situation. Tip: If you open a New Notebook related to this dataset you can easily add new data available on Kaggle and link that to you analysis. - HOW'S MY FLATTENING COVID19 DATASET - This dataset has a lot more files and includes a lot of what I was talking about, so if you produce good kernels there you can also count them for your asg4 grade. https://www.kaggle.com/howsmyflattening/covid19-challenges - Predict: make predictions about confirmed cases, deaths, recoveries or other metrics for the future. You can test you models by training on the past and predicting on the following days, then post a prediction for tomorrow or the next few days given ALL the data up to this point. Hopefully the datasets we've linked here will updated automatically so your kernels would update as well. - Create Tasks: you can make your own "Tasks" as part of this kaggle and propose your own solution to it. Then others can try solving it as well. - Groups: students can do this assignment either in the same groups they had for assignment 3 or individually.
We're happy to add other relevant data to this Kaggle, in particular it would be great to integrate live data on the following: - Progression of each country/region/city in "days since X Level" such as Days since 100 confirmed cases, see the link for a great example such a dataset being plotted. I haven't see a live link to a csv of that data, but we could generate. - Mitigation Policies enacted by local governments in each city/region/country. These are dates when that region first enacted Level 1, 2, 3, 4 containment, or started encouraging social distancing or the date when they closed different levels of schools, pubs, restaurants etc. - The hidden positives: this would be a dataset, or method for estimating, as described by Emtiyaz Khan in this twitter thread. The idea is, how many unreported or unconfirmed cases are there in any region, and can we build an estimate of that number using other regions with widespread testing as a baseline and the death rates which are like an observation of a process with a hidden variable or true infection rate. - Paper discussing one way to compute this : https://cmmid.github.io/topics/covid19/severity/global_cfr_estimates.html
Facebook
TwitterDescription 👋🛳️ Ahoy, welcome to Kaggle! You’re in the right place. This is the legendary Titanic ML competition – the best, first challenge for you to dive into ML competitions and familiarize yourself with how the Kaggle platform works.
If you want to talk with other users about this competition, come join our Discord! We've got channels for competitions, job postings and career discussions, resources, and socializing with your fellow data scientists. Follow the link here: https://discord.gg/kaggle
The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.
Read on or watch the video below to explore more details. Once you’re ready to start competing, click on the "Join Competition button to create an account and gain access to the competition data. Then check out Alexis Cook’s Titanic Tutorial that walks you through step by step how to make your first submission!
The Challenge The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
Recommended Tutorial We highly recommend Alexis Cook’s Titanic Tutorial that walks you through making your very first submission step by step and this starter notebook to get started.
How Kaggle’s Competitions Work Join the Competition Read about the challenge description, accept the Competition Rules and gain access to the competition dataset. Get to Work Download the data, build models on it locally or on Kaggle Notebooks (our no-setup, customizable Jupyter Notebooks environment with free GPUs) and generate a prediction file. Make a Submission Upload your prediction as a submission on Kaggle and receive an accuracy score. Check the Leaderboard See how your model ranks against other Kagglers on our leaderboard. Improve Your Score Check out the discussion forum to find lots of tutorials and insights from other competitors. Kaggle Lingo Video You may run into unfamiliar lingo as you dig into the Kaggle discussion forums and public notebooks. Check out Dr. Rachael Tatman’s video on Kaggle Lingo to get up to speed!
What Data Will I Use in This Competition? In this competition, you’ll gain access to two similar datasets that include passenger information like name, age, gender, socio-economic class, etc. One dataset is titled train.csv and the other is titled test.csv.
Train.csv will contain the details of a subset of the passengers on board (891 to be exact) and importantly, will reveal whether they survived or not, also known as the “ground truth”.
The test.csv dataset contains similar information but does not disclose the “ground truth” for each passenger. It’s your job to predict these outcomes.
Using the patterns you find in the train.csv data, predict whether the other 418 passengers on board (found in test.csv) survived.
Check out the “Data” tab to explore the datasets even further. Once you feel you’ve created a competitive model, submit it to Kaggle to see where your model stands on our leaderboard against other Kagglers.
How to Submit your Prediction to Kaggle Once you’re ready to make a submission and get on the leaderboard:
Click on the “Submit Predictions” button
Upload a CSV file in the submission file format. You’re able to submit 10 submissions a day.
Submission File Format: You should submit a csv file with exactly 418 entries plus a header row. Your submission will show an error if you have extra columns (beyond PassengerId and Survived) or rows.
The file should have exactly 2 columns:
PassengerId (sorted in any order) Survived (contains your binary predictions: 1 for survived, 0 for deceased) Got it! I’m ready to get started. Where do I get help if I need it? For Competition Help: Titanic Discussion Forum Kaggle doesn’t have a dedicated team to help troubleshoot your code so you’ll typically find that you receive a response more quickly by asking your question in the appropriate forum. The forums are full of useful information on the data, metric, and different approaches. We encourage you to use the forums often. If you share your knowledge, you'll find that others will share a lot in turn!
A Last Word on Kaggle Notebooks As we mentioned before, Kaggle Notebooks is our no-setup, customizable, Jupyter Notebooks environment with free GPUs and a huge repository ...