37 datasets found

Student Marks Dataset
kaggle.com
Updated Jan 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M Yasser H (2022). Student Marks Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/student-marks-dataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 4, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
M Yasser H
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
https://raw.githubusercontent.com/Masterx-AI/Project_Student_Marks_Prediction_/main/smp.jpg" alt="">

Description:

The data consists of Marks of students including their study time & number of courses. The dataset is downloaded from UCI Machine Learning Repository.

Properties of the Dataset:
Number of Instances: 100
Number of Attributes: 3 including the target variable.

The project is simple yet challenging as it is has very limited features & samples. Can you build regression model to capture all the patterns in the dataset, also maitaining the generalisability of the model?

Objective:

Understand the Dataset & cleanup (if required).

Build Regression models to predict the student marks wrt multiple features.

Also evaluate the models & compare their respective scores like R2, RMSE, etc.
Ad Click Prediction - Classification Problem
kaggle.com
Updated Jul 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jahanvee Narang (2021). Ad Click Prediction - Classification Problem [Dataset]. https://www.kaggle.com/jahnveenarang/cvdcvd-vd/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 4, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jahanvee Narang
Description
**New to machine learning and data science? No question is too basic or too simple. Use this place to post any first-timer clarifying questions for the classification algorithm or related to datasets ** !This file contains demographics about customer and whether that customer clicked the ad or not . You this file to use classification algorithm to predict on the basis of demographics of customer as independent variable

This data set contains the following features:

This data set contains the following features:

'User ID': unique identification for consumer

'Age': cutomer age in years

'Estimated Salary': Avg. Income of consumer

'Gender': Whether consumer was male or female

'Purchased': 0 or 1 indicated clicking on Ad
Answer Prediction Dataset
kaggle.com
Updated Dec 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aman Kumar (2022). Answer Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/a2m2a2n2/question-answer-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 23, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aman Kumar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Recently I found this dataset from a problem statement of Inter IIT Tech Meet 11.0 being conducted by IIT Kanpur this year. for more info visit https://interiit-tech.org/ and for the problem statement https://interiit-tech.org/images/ps/High_Devrev.pdf The Dataset is quite interesting and apart from the problem statement, it can also be used in many different ways.
A
‘Loan Prediction Problem Dataset’ analyzed by Analyst-2
analyst-2.ai
Updated Aug 4, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2020). ‘Loan Prediction Problem Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-loan-prediction-problem-dataset-e270/latest
Explore at:
Dataset updated
Aug 4, 2020
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Loan Prediction Problem Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/altruistdelhite04/loan-prediction-problem-dataset on 28 January 2022.

--- No further description of dataset provided by original source ---

--- Original source retains full ownership of the source dataset ---
The Stanford Question Answering Dataset
kaggle.com
zip
Updated Nov 25, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Le Viet Thang (2020). The Stanford Question Answering Dataset [Dataset]. https://www.kaggle.com/toreleon/squad-20-the-stanford-question-answering-dataset
Explore at:
zip(10281338 bytes)Available download formats
Dataset updated
Nov 25, 2020
Authors
Le Viet Thang
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Context

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

Content

There are two files to help you get started with the dataset and evaluate your models:

train-v2.0.json

dev-v2.0.json

Acknowledgements

The original datasets can be found here.

Inspiration

Can you build a prediction model that can accurately predict answers to different types of questions?

You can also explore SQuAD here
f
Comparison results of different model.
plos.figshare.com
xls
Updated Dec 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ke Peng; Yan Peng; Wenguang Li (2023). Comparison results of different model. [Dataset]. http://doi.org/10.1371/journal.pone.0289724.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0289724.t006
Dataset updated
Dec 8, 2023
Dataset provided by
PLOS ONE
Authors
Ke Peng; Yan Peng; Wenguang Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In recent years, with the continuous improvement of the financial system and the rapid development of the banking industry, the competition of the banking industry itself has intensified. At the same time, with the rapid development of information technology and Internet technology, customers’ choice of financial products is becoming more and more diversified, and customers’ dependence and loyalty to banking institutions is becoming less and less, and the problem of customer churn in commercial banks is becoming more and more prominent. How to predict customer behavior and retain existing customers has become a major challenge for banks to solve. Therefore, this study takes a bank’s business data on Kaggle platform as the research object, uses multiple sampling methods to compare the data for balancing, constructs a bank customer churn prediction model for churn identification by GA-XGBoost, and conducts interpretability analysis on the GA-XGBoost model to provide decision support and suggestions for the banking industry to prevent customer churn. The results show that: (1) The applied SMOTEENN is more effective than SMOTE and ADASYN in dealing with the imbalance of banking data. (2) The F1 and AUC values of the model improved and optimized by XGBoost using genetic algorithm can reach 90% and 99%, respectively, which are optimal compared to other six machine learning models. The GA-XGBoost classifier was identified as the best solution for the customer churn problem. (3) Using Shapley values, we explain how each feature affects the model results, and analyze the features that have a high impact on the model prediction, such as the total number of transactions in the past year, the amount of transactions in the past year, the number of products owned by customers, and the total sales balance. The contribution of this paper is mainly in two aspects: (1) this study can provide useful information from the black box model based on the accurate identification of churned customers, which can provide reference for commercial banks to improve their service quality and retain customers; (2) it can provide reference for customer churn early warning models of other related industries, which can help the banking industry to maintain customer stability, maintain market position and reduce corporate losses.
Student Engagement
kaggle.com
Updated Nov 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Student Engagement [Dataset]. https://www.kaggle.com/datasets/thedevastator/student-engagement-with-tableau-a-data-science-p
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 23, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Student Engagement

Predicting Engagement and Exam Performance

By [source]

About this dataset

This dataset contains information on student engagement with Tableau, including quizzes, exams, and lessons. The data includes the course title, the rating of the course, the date the course was rated, the exam category, the exam duration, whether the answer was correct or not, the number of quizzes completed, the number of exams completed, the number of lessons completed, the date engaged, the exam result, and more

How to use the dataset

The 'Student Engagement with Tableau' dataset offers insights into student engagement with the Tableau software. The data includes information on courses, exams, quizzes, and student learning.

This dataset can be used to examine how students use Tableau, what kind of engagement leads to better learning outcomes, and whether certain course or exam characteristics are associated with student engagement

Research Ideas

Creating a heat map of student engagement by course and location

Determining which courses are most popular among students from different countries

Identifying patterns in students' exam results

Acknowledgements

Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: 365_course_info.csv | Column name | Description | |:-----------------|:----------------------------------| | course_title | The title of the course. (String) |

File: 365_course_ratings.csv | Column name | Description | |:------------------|:---------------------------------------------------------| | course_rating | The rating given to the course by the student. (Numeric) | | date_rated | The date on which the course was rated. (Date) |

File: 365_exam_info.csv | Column name | Description | |:------------------|:-------------------------------------------------| | exam_category | The category of the exam. (Categorical) | | exam_duration | The duration of the exam in minutes. (Numerical) |

File: 365_quiz_info.csv | Column name | Description | |:-------------------|:----------------------------------------------------------------------| | answer_correct | Whether or not the student answered the question correctly. (Boolean) |

File: 365_student_engagement.csv | Column name | Description | |:-----------------------|:------------------------------------------------------------------| | engagement_quizzes | The number of times a student has engaged with quizzes. (Numeric) | | engagement_exams | The number of times a student has engaged with exams. (Numeric) | | engagement_lessons | The number of times a student has engaged with lessons. (Numeric) | | date_engaged | The date of the student's engagement. (Date) |

File: 365_student_exams.csv | Column name | Description | |:-------------------------|:---------------------------------------------------| | exam_result | The result of the exam. (Categorical) | | exam_completion_time | The time it took to complete the exam. (Numerical) | | date_exam_completed | The date the exam was completed. (Date) |

File: 365_student_hub_questions.csv | Column name | Description | |:------------------------|:----------------------------------------| | date_question_asked | The date the question was asked. (Date) |

File: 365_student_info.csv | Column name | Description | |:--------------------|:-------------------------------------------------------| | student_country | The country of the student. (Categorical) | | date_registered | The date the student registered for the course. (Date) |

File: 365_student_learning.csv | Column name | Description | |:--------------------|:------------------------------...
Data from: Back Order Prediction Dataset
kaggle.com
Updated Nov 7, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gowtham Miryala (2021). Back Order Prediction Dataset [Dataset]. https://www.kaggle.com/gowthammiryala/back-order-prediction-dataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 7, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gowtham Miryala
Description
Problem Statement

In a supply chain system, Material backorder is a common problem, impacting an inventory system service level and effectiveness. Identifying parts with the highest chances of shortage prior to its occurrence can present a high opportunity to improve an overall company’s performance. In this project, we will train classifiers to predict future back-ordered products and generate predictions for a test set.

File descriptions

Here we have two CSV files (Training_BOP.csv and Testing_BOP.csv). Training_BOP.csv - the training set Testing_BOP.csv - the testing set

Each file has 23 columns and the last column (went_on_backorder) is the target column.

Data fields

sku - sku code national_inv - Current inventory level of component lead_time - Transit time in_transit_qtry - Quantity in transit forecast_x_month - Forecast sales for the net 3, 6, 9 months sales_x_month - Sales quantity for the prior 1, 3, 6, 9 months min_bank - Minimum recommended amount in stock potential_issue - Indictor variable noting potential issue with item pieces_past_due - Parts overdue from source perf_x_months_avg - Source performance in the last 6 and 12 months local_bo_qty - Amount of stock orders overdue X17-X22 - General Risk Flags went_on_back_order - Product went on backorder Validation - indicator variable for training (0), validation (1), and test set (2)

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?
f
Details of feature variables of the data set.
plos.figshare.com
xls
Updated Dec 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ke Peng; Yan Peng; Wenguang Li (2023). Details of feature variables of the data set. [Dataset]. http://doi.org/10.1371/journal.pone.0289724.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0289724.t002
Dataset updated
Dec 8, 2023
Dataset provided by
PLOS ONE
Authors
Ke Peng; Yan Peng; Wenguang Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In recent years, with the continuous improvement of the financial system and the rapid development of the banking industry, the competition of the banking industry itself has intensified. At the same time, with the rapid development of information technology and Internet technology, customers’ choice of financial products is becoming more and more diversified, and customers’ dependence and loyalty to banking institutions is becoming less and less, and the problem of customer churn in commercial banks is becoming more and more prominent. How to predict customer behavior and retain existing customers has become a major challenge for banks to solve. Therefore, this study takes a bank’s business data on Kaggle platform as the research object, uses multiple sampling methods to compare the data for balancing, constructs a bank customer churn prediction model for churn identification by GA-XGBoost, and conducts interpretability analysis on the GA-XGBoost model to provide decision support and suggestions for the banking industry to prevent customer churn. The results show that: (1) The applied SMOTEENN is more effective than SMOTE and ADASYN in dealing with the imbalance of banking data. (2) The F1 and AUC values of the model improved and optimized by XGBoost using genetic algorithm can reach 90% and 99%, respectively, which are optimal compared to other six machine learning models. The GA-XGBoost classifier was identified as the best solution for the customer churn problem. (3) Using Shapley values, we explain how each feature affects the model results, and analyze the features that have a high impact on the model prediction, such as the total number of transactions in the past year, the amount of transactions in the past year, the number of products owned by customers, and the total sales balance. The contribution of this paper is mainly in two aspects: (1) this study can provide useful information from the black box model based on the accurate identification of churned customers, which can provide reference for commercial banks to improve their service quality and retain customers; (2) it can provide reference for customer churn early warning models of other related industries, which can help the banking industry to maintain customer stability, maintain market position and reduce corporate losses.
A
‘Titanic Dataset’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Titanic Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-titanic-dataset-bec7/bfa18318/?iid=006-936&v=presentation
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Titanic Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yasserh/titanic-dataset on 28 January 2022.

--- Dataset description provided by original source is as follows ---

https://raw.githubusercontent.com/Masterx-AI/Project_Titanic_Survival_Prediction_/main/titanic.jpg" alt="">

Description:

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

Acknowledgements:

This dataset has been referred from Kaggle: https://www.kaggle.com/c/titanic/data.

Objective:

Understand the Dataset & cleanup (if required).

Build a strong classification model to predict whether the passenger survives or not.

Also fine-tune the hyperparameters & compare the evaluation metrics of various classification algorithms.

--- Original source retains full ownership of the source dataset ---
Perfect Score House Prices
kaggle.com
Updated May 26, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Efrad Galio (2020). Perfect Score House Prices [Dataset]. https://www.kaggle.com/datasets/efradgamer/perfect-score-house-prices/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 26, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Efrad Galio
Description
This is the test set actual prediction for House Prices Problem. Don't submit this submission into the leaderboard, because it will screw up. Use this for saving your time without submitting again and again
StackSample: 10% of Stack Overflow Q&A
kaggle.com
Updated Oct 8, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stack Overflow (2019). StackSample: 10% of Stack Overflow Q&A [Dataset]. https://www.kaggle.com/stackoverflow/stacksample/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 8, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Stack Overflow
Description
Dataset with the text of 10% of questions and answers from the Stack Overflow programming Q&A website.

This is organized as three tables:

Questions contains the title, body, creation date, closed date (if applicable), score, and owner ID for all non-deleted Stack Overflow questions whose Id is a multiple of 10.

Answers contains the body, creation date, score, and owner ID for each of the answers to these questions. The ParentId column links back to the Questions table.

Tags contains the tags on each of these questions

Datasets of all R questions and all Python questions are also available on Kaggle, but this dataset is especially useful for analyses that span many languages.

Example projects include:

Identifying tags from question text

Predicting whether questions will be upvoted, downvoted, or closed based on their text

Predicting how long questions will take to answer

License

All Stack Overflow user contributions are licensed under CC-BY-SA 3.0 with attribution required.
Question-Answer combination
kaggle.com
Updated Jan 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ailurophile (2020). Question-Answer combination [Dataset]. https://www.kaggle.com/veeralakrishna/questionanswer-combination/metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 9, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ailurophile
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Valuelabs ML Hackathon

About Problem Statement: Genre: NLP - Problem Type: Contextual Semantic Similarity, Auto-generate Text-based answers

Submission Format: - You need to generate upto 3 distractors for each Question-Answer combination - Each distractor is a string - The 3 distractors/strings need to be separated with a comma (,) - Each value in Results.csv's distractor column will contain the distractors as follows: distractor_for_QnA_1 = "distractor1","distractor2","distractor3"

About the Evaluation Parameter: - All distractor values for 1 question-answer will be converted into a vector form - 1 vector gets generated for submitted distractors and 1 vector is generated for truth value - cosine_similarity between these 2 vectors is evaluated - Similarly, cosine_similarity gets evaluated for all the question-answer combinations - Score of your submitted prediction file = mean ( cosine_similarity between distractor vectors for each entry in test.csv)

Common Issues Faced: How to handle them?:

Download Dataset giving XML error: Try restarting your session after clearing browser cache/cookies and try again. If you still face any issue, please raise a ticket with us.

Upload Prediction File not working: Ensure you are compliant with the Guidelines and FAQs. You will face this error if you exceed the maximum number of prediction file uploads allowed.

Exceptions (Incorrect number of Rows / Incorrect Headers / Prediction missing for a key): For this problem statement, we recommend you to update the 'distractor' column in Results.csv with your predictions, following the format explained above Evaluation is getting stuck in a loop : We recommend you to immediately refresh your session and start afresh with a cleared cache. Please ensure your predictions.csv matches the file format Results.csv. Please check that all the above mentioned checks have been conducted. If you still face any issue, please raise a ticket with us.
Sepsis Survival Prediction
kaggle.com
Updated Oct 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joakim Arvidsson (2023). Sepsis Survival Prediction [Dataset]. https://www.kaggle.com/datasets/joebeachcapital/sepsis-survival-minimal-clinical-records
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 19, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Joakim Arvidsson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset consists of 110,204 admissions of 84,811 hospitalized subjects between 2011 and 2012 in Norway who were diagnosed with infections, systemic inflammatory response syndrome, sepsis by causative microbes, or septic shock. The prediction task is to determine whether a patient survived or is deceased at a time of about 9 days after collecting their medical record at the hospital. This is an important prediction problem in clinical medicine. Sepsis is a life-threatening condition triggered by an immune overreaction to infection, leading to organ failure or even death. Sepsis is associated with immediate death risk, often killing patients within one hour. This renders many laboratory tests and hospital analyses impractical for timely diagnosis and treatment. Being able to predict the survival of patients within minutes with as few and easy-to-retrieve medical features as possible is very important.

What do the instances in this dataset represent?

For the primary cohort, they represent records of patients affected by sepsis potential preconditions (ante Sepsis-3 definition); for the study cohort, they represent only the patients’ admissions defined by the novel Sepsis-3 definition.

Are there recommended data splits?

No recommendation, standard train-test split could be used. Can use three-way holdout split (i.e., training, validation/development, testing) when doing model selection.

Does the dataset contain data that might be considered sensitive in any way?

Yes. It contains information about the gender and age of the patient.

Was there any data preprocessing performed?

All the categorical variables have been encoded (so no preprocessing is necessary).

Additional Information

Primary cohort from Norway: - 4 features for 110,204 patient admissions - file: 's41598-020-73558-3_sepsis_survival_primary_cohort.csv'

Study cohort (a subset of the primary cohort) from Norway: - 4 features for 19,051 patient admissions - file: 's41598-020-73558-3_sepsis_survival_study_cohort.csv'

Validation cohort from South Korea: - 4 features for 137 patients - file: 's41598-020-73558-3_sepsis_survival_validation_cohort.csv'

The validation cohort from South Korea was used by Chicco and Jurman (2020) as an external validation cohort to confirm the generalizability of their proposed approach.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2003977%2Ff3ff5773d5e02d2938b3ac4a0fa23e17%2FScreenshot%20from%202023-10-19%2010-05-33.png?generation=1697673946566777&alt=media" alt="">
GitHub Bugs Prediction Challenge (Machine Hack)
kaggle.com
Updated Oct 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ask9 (2020). GitHub Bugs Prediction Challenge (Machine Hack) [Dataset]. https://www.kaggle.com/arbazkhan971/github-bugs-prediction-challenge-machine-hack/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 8, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
ask9
Description
Overview Foreseeing bugs, features, and questions on GitHub can be fun, especially when one is provided with a colossal dataset containing the GitHub issues. In this hackathon, we are challenging the MachineHack community to come up with an algorithm that can predict the bugs, features, and questions based on GitHub titles and the text body. With text data, there can be a lot of challenges especially when the dataset is big. Analyzing such a dataset requires a lot to be taken into account mainly due to the preprocessing involved to represent raw text and make them machine-understandable. Usually, we stem and lemmatize the raw information and then represent it using TF-IDF, Word Embeddings, etc.

However, provided the state-of-the-art NLP models such as Transformer based BERT models one can skip the manual feature engineering like TF-IDF and Count Vectorizers. In this short span of time, we would encourage you to leverage the ImageNet moment (Transfer Learning) in NLP using various pre-trained models.

In this hackathon, we also have an interesting learning curve for all the machine learning specialists to write some quality code to win the prizes, as the evaluation involves getting a code quality score using the Embold Code Analysis platform here.

Every participant has to register on the Embold's platform for free as a mandatory step before proceeding with the hackathon

Here is a quick tour of how to use the Embold's Code Analysis Platform for FREE !!

Dataset Description: Train.json - 150000 rows x 3 columns (Includes label Column as Target variable) Test.json - 30000 rows x 2 columns Train_extra.json - 300000 rows x 3 columns (Includes label Column as Target variable) Provided solely for training purposes, can be appended in the train.json for training the model Sample Submission.csv - Please check the Evaluation section for more details on how to generate a valid submission

Attribute Description: Title - the title of the GitHub bug, feature, question Body - the body of the GitHub bug, feature, question Label - Represents various classes of Labels Bug - 0 Feature - 1 Question - 2 Skills: Natural Language Processing Feature extraction from raw text using TF-IDF, CountVectorizer Using Word Embedding to represent words as vectors Using Pretrained models like Transformers, BERT Optimizing accuracy score as a metric to generalize well on unseen data
Predict Loan Approval Problem
kaggle.com
Updated Nov 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AbdelrahmanTarek22 (2021). Predict Loan Approval Problem [Dataset]. https://www.kaggle.com/abdelrahmantarek22/predict-loan-approval-problem/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 11, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
AbdelrahmanTarek22
Description
Dataset

This dataset was created by AbdelrahmanTarek22

Contents
Question Classification: Android or iOS?
kaggle.com
zip
Updated Oct 29, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
xhlulu (2020). Question Classification: Android or iOS? [Dataset]. https://www.kaggle.com/xhlulu/question-classification-android-or-ios
Explore at:
zip(19598168 bytes)Available download formats
Dataset updated
Oct 29, 2020
Authors
xhlulu
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Context

Imagine you have to process bug reports about an application your company is developing, which is available for both Android and iOS. Could you find a way to automatically classify them so you can send them to the right support team?

Content

The dataset contains data from two StackExchange forums: Android Enthusiasts and Ask Differently (Apple). I pre-processed both datasets from the raw XML files retrieved from Internet Archive in order to only contain useful information for building Machine Learning classifiers. In the case of the Apple forum, I narrowed down to the subset of questions that have one of the following tags: "iOS", "iPhone", "iPad".

Think of this as a fun way to learn to build ML classifiers! The training, validation and test sets are all available, but in order to build robust models please try to use the test set as little as possible (only as a last validation for your models).

Acknowledgements

The image was retrieved from unsplash and made by @thenewmalcolm. Link to image here.

The data was made available for free under a CC-BY-SA 4.0 license by StackExchange and hosted by Internet Archive. Find it here.
Loan_Prediction (Numerical_values)
kaggle.com
zip
Updated Jan 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaimaa Othman (2021). Loan_Prediction (Numerical_values) [Dataset]. https://www.kaggle.com/shaimaa1234/loan-prediction-numerical-values
Explore at:
zip(39514 bytes)Available download formats
Dataset updated
Jan 12, 2021
Authors
Shaimaa Othman
Description
Dataset

This dataset was created by Shaimaa Othman

Contents
Titanic Dataset
kaggle.com
Updated Dec 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M Yasser H (2021). Titanic Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/titanic-dataset/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 24, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
M Yasser H
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
https://raw.githubusercontent.com/Masterx-AI/Project_Titanic_Survival_Prediction_/main/titanic.jpg" alt="">

Description:

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

Acknowledgements:

This dataset has been referred from Kaggle: https://www.kaggle.com/c/titanic/data.

Objective:

Understand the Dataset & cleanup (if required).

Build a strong classification model to predict whether the passenger survives or not.

Also fine-tune the hyperparameters & compare the evaluation metrics of various classification algorithms.
Customer Churn - Decision Tree & Random Forest
kaggle.com
Updated Jul 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vikram amin (2023). Customer Churn - Decision Tree & Random Forest [Dataset]. https://www.kaggle.com/datasets/vikramamin/customer-churn-decision-tree-and-random-forest
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 6, 2023
Dataset provided by
Kaggle
Authors
vikram amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Main objective: Find out customers who will churn and who will not.

Methodology: It is a classification problem. We will use decision tree and random forest to predict the outcome.

Steps Involved

Read the data

Check for data types https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F1ffb600d8a4b4b36bc25e957524a3524%2FPicture1.png?generation=1688638600831386&alt=media" alt="">

Change character vector to factor vector as this is as classification problem

Drop the variable which is not significant for the analysis. We drop "customerID".

Check for missing values. None are found.

Split the data into train and test so we can use the train data for building the model and use test data for prediction. We split this into 80-20 ratio (train/test) using the sample function.

Install and run libraries (rpart, rpart.plot, rattle, RColorBrewer, caret)

Run decision tree using rpart function. The dependent variable is Churn and 19 other independent variables

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F8d3442e6c82d8026c6a448e4780ab38c%2FPicture2.png?generation=1688638685268853&alt=media" alt=""> 9. Plot the decision tree

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F9ab0591e323dc30fe116c79f6d014d06%2FPicture3.png?generation=1688638747644320&alt=media" alt="">

Average customer churn is 27%. The churn can take place if the tenure is more than >=7.5 and there is no internet service

Tuning the model

Define the search grid using the expand.grid function

Set up the control parameters through 5 fold cross validation

When we print the model we get the best CP = 0.01 and an accuracy of 79.00%

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F16080ac04d3743ec238227e1ef2c8269%2FPicture4.png?generation=1688639197455166&alt=media" alt="">

Predict the model

Find out the variables which are most and least significant. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F61beb4224e9351cfc772147c43800502%2FPicture5.png?generation=1688639468638950&alt=media" alt="">

Significant variables are Internet Service, Tenure and the least significant are Streaming Movies, Tech Support.

USE RANDOM FOREST

Run library(randomForest). Here we are using the default ntree (500) and mtry (p/3) where p is the number of independent variables. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc27fe7e83f0b53b7e067371b69c7f4a7%2FPicture6.png?generation=1688640478682685&alt=media" alt="">

Through confusion matrix, accuracy is coming 79.27%. The accuracy is marginally higher than that of decision tree i.e 79.00%. The error rate is pretty low when predicting "No" and much higher when predicting "Yes".

Plot the model showing which variables reduce the gini impunity the most and least. Total charges and tenure reduce the gini impunity the most while phone service has the least impact.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fec25fc3ba74ab9cef1a81188209512b1%2FPicture7.png?generation=1688640726235724&alt=media" alt="">

Predict the model and create a new data frame showing the actuals vs predicted values

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F50aa40e5dd676c8285020fd2fe627bf1%2FPicture8.png?generation=1688640896763066&alt=media" alt="">

Plot the model so as to find out where the OOB (out of bag ) error stops decreasing or becoming constant. As we can see that the error stops decreasing between 100 to 200 trees. So we decide to take ntree = 200 when we tune the model.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F87211e1b218c595911fbe6ea2806e27a%2FPicture9.png?generation=1688641103367564&alt=media" alt="">

Tune the model mtry=2 has the lowest OOB error rate

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6057af5bb0719b16f1a97a58c3d4aa1d%2FPicture10.png?generation=1688641391027971&alt=media" alt="">

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc7045eba4ee298c58f1bd0230c24c00d%2FPicture11.png?generation=1688641605829830&alt=media" alt="">

Use random forest with mtry = 2 and ntree = 200

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F01541eff1f9c6303591aa50dd707b5f5%2FPicture12.png?generation=1688641634979403&alt=media" alt="">

Through confusion matrix, accuracy is coming 79.71%. The accuracy is marginally higher than that of default (when ntree was 500 and mtry was 4) i.e 79.27% and of decision tree i.e 79.00%. The error rate is pretty low when predicting "No" and m...

Facebook

Twitter

Click to copy link

Link copied

Cite

M Yasser H (2022). Student Marks Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/student-marks-dataset/code

Student Marks Dataset

Student Marks Prediction - Regression Problem

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jan 4, 2022

Dataset provided by

Kagglehttp://kaggle.com/

Authors

M Yasser H

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

https://raw.githubusercontent.com/Masterx-AI/Project_Student_Marks_Prediction_/main/smp.jpg" alt="">

Description:

The data consists of Marks of students including their study time & number of courses. The dataset is downloaded from UCI Machine Learning Repository.

Properties of the Dataset:
Number of Instances: 100
Number of Attributes: 3 including the target variable.

The project is simple yet challenging as it is has very limited features & samples. Can you build regression model to capture all the patterns in the dataset, also maitaining the generalisability of the model?

Objective:

Understand the Dataset & cleanup (if required).
Build Regression models to predict the student marks wrt multiple features.
Also evaluate the models & compare their respective scores like R2, RMSE, etc.

Clear search

Close search

Google apps

Main menu

Student Marks Dataset

Description:

Objective:

Ad Click Prediction - Classification Problem

Answer Prediction Dataset

‘Loan Prediction Problem Dataset’ analyzed by Analyst-2

The Stanford Question Answering Dataset

Context

Content

Acknowledgements

Inspiration

Comparison results of different model.

Student Engagement

Student Engagement

Predicting Engagement and Exam Performance

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Data from: Back Order Prediction Dataset

Problem Statement

File descriptions

Data fields

Acknowledgements

Inspiration

Details of feature variables of the data set.

‘Titanic Dataset’ analyzed by Analyst-2

Description:

Acknowledgements:

Objective:

Perfect Score House Prices

StackSample: 10% of Stack Overflow Q&A

License

Question-Answer combination

Valuelabs ML Hackathon

Sepsis Survival Prediction

GitHub Bugs Prediction Challenge (Machine Hack)

Predict Loan Approval Problem

Dataset

Contents

Question Classification: Android or iOS?

Context

Content

Acknowledgements

Loan_Prediction (Numerical_values)

Dataset

Contents

Titanic Dataset

Description:

Acknowledgements:

Objective:

Customer Churn - Decision Tree & Random Forest

USE RANDOM FOREST

Student Marks Dataset

Student Marks Prediction - Regression Problem

Description:

Objective: