https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://raw.githubusercontent.com/Masterx-AI/Project_Student_Marks_Prediction_/main/smp.jpg" alt="">
The data consists of Marks of students including their study time & number of courses. The dataset is downloaded from UCI Machine Learning Repository.
Properties of the Dataset:
Number of Instances: 100
Number of Attributes: 3 including the target variable.
The project is simple yet challenging as it is has very limited features & samples. Can you build regression model to capture all the patterns in the dataset, also maitaining the generalisability of the model?
**New to machine learning and data science? No question is too basic or too simple. Use this place to post any first-timer clarifying questions for the classification algorithm or related to datasets ** !This file contains demographics about customer and whether that customer clicked the ad or not . You this file to use classification algorithm to predict on the basis of demographics of customer as independent variable
This data set contains the following features:
This data set contains the following features:
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Recently I found this dataset from a problem statement of Inter IIT Tech Meet 11.0 being conducted by IIT Kanpur this year. for more info visit https://interiit-tech.org/ and for the problem statement https://interiit-tech.org/images/ps/High_Devrev.pdf The Dataset is quite interesting and apart from the problem statement, it can also be used in many different ways.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Loan Prediction Problem Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/altruistdelhite04/loan-prediction-problem-dataset on 28 January 2022.
--- No further description of dataset provided by original source ---
--- Original source retains full ownership of the source dataset ---
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
There are two files to help you get started with the dataset and evaluate your models:
The original datasets can be found here.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In recent years, with the continuous improvement of the financial system and the rapid development of the banking industry, the competition of the banking industry itself has intensified. At the same time, with the rapid development of information technology and Internet technology, customers’ choice of financial products is becoming more and more diversified, and customers’ dependence and loyalty to banking institutions is becoming less and less, and the problem of customer churn in commercial banks is becoming more and more prominent. How to predict customer behavior and retain existing customers has become a major challenge for banks to solve. Therefore, this study takes a bank’s business data on Kaggle platform as the research object, uses multiple sampling methods to compare the data for balancing, constructs a bank customer churn prediction model for churn identification by GA-XGBoost, and conducts interpretability analysis on the GA-XGBoost model to provide decision support and suggestions for the banking industry to prevent customer churn. The results show that: (1) The applied SMOTEENN is more effective than SMOTE and ADASYN in dealing with the imbalance of banking data. (2) The F1 and AUC values of the model improved and optimized by XGBoost using genetic algorithm can reach 90% and 99%, respectively, which are optimal compared to other six machine learning models. The GA-XGBoost classifier was identified as the best solution for the customer churn problem. (3) Using Shapley values, we explain how each feature affects the model results, and analyze the features that have a high impact on the model prediction, such as the total number of transactions in the past year, the amount of transactions in the past year, the number of products owned by customers, and the total sales balance. The contribution of this paper is mainly in two aspects: (1) this study can provide useful information from the black box model based on the accurate identification of churned customers, which can provide reference for commercial banks to improve their service quality and retain customers; (2) it can provide reference for customer churn early warning models of other related industries, which can help the banking industry to maintain customer stability, maintain market position and reduce corporate losses.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
This dataset contains information on student engagement with Tableau, including quizzes, exams, and lessons. The data includes the course title, the rating of the course, the date the course was rated, the exam category, the exam duration, whether the answer was correct or not, the number of quizzes completed, the number of exams completed, the number of lessons completed, the date engaged, the exam result, and more
The 'Student Engagement with Tableau' dataset offers insights into student engagement with the Tableau software. The data includes information on courses, exams, quizzes, and student learning.
This dataset can be used to examine how students use Tableau, what kind of engagement leads to better learning outcomes, and whether certain course or exam characteristics are associated with student engagement
- Creating a heat map of student engagement by course and location
- Determining which courses are most popular among students from different countries
- Identifying patterns in students' exam results
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: 365_course_info.csv | Column name | Description | |:-----------------|:----------------------------------| | course_title | The title of the course. (String) |
File: 365_course_ratings.csv | Column name | Description | |:------------------|:---------------------------------------------------------| | course_rating | The rating given to the course by the student. (Numeric) | | date_rated | The date on which the course was rated. (Date) |
File: 365_exam_info.csv | Column name | Description | |:------------------|:-------------------------------------------------| | exam_category | The category of the exam. (Categorical) | | exam_duration | The duration of the exam in minutes. (Numerical) |
File: 365_quiz_info.csv | Column name | Description | |:-------------------|:----------------------------------------------------------------------| | answer_correct | Whether or not the student answered the question correctly. (Boolean) |
File: 365_student_engagement.csv | Column name | Description | |:-----------------------|:------------------------------------------------------------------| | engagement_quizzes | The number of times a student has engaged with quizzes. (Numeric) | | engagement_exams | The number of times a student has engaged with exams. (Numeric) | | engagement_lessons | The number of times a student has engaged with lessons. (Numeric) | | date_engaged | The date of the student's engagement. (Date) |
File: 365_student_exams.csv | Column name | Description | |:-------------------------|:---------------------------------------------------| | exam_result | The result of the exam. (Categorical) | | exam_completion_time | The time it took to complete the exam. (Numerical) | | date_exam_completed | The date the exam was completed. (Date) |
File: 365_student_hub_questions.csv | Column name | Description | |:------------------------|:----------------------------------------| | date_question_asked | The date the question was asked. (Date) |
File: 365_student_info.csv | Column name | Description | |:--------------------|:-------------------------------------------------------| | student_country | The country of the student. (Categorical) | | date_registered | The date the student registered for the course. (Date) |
File: 365_student_learning.csv | Column name | Description | |:--------------------|:------------------------------...
In a supply chain system, Material backorder is a common problem, impacting an inventory system service level and effectiveness. Identifying parts with the highest chances of shortage prior to its occurrence can present a high opportunity to improve an overall company’s performance. In this project, we will train classifiers to predict future back-ordered products and generate predictions for a test set.
Here we have two CSV files (Training_BOP.csv and Testing_BOP.csv). Training_BOP.csv - the training set Testing_BOP.csv - the testing set
Each file has 23 columns and the last column (went_on_backorder) is the target column.
sku - sku code national_inv - Current inventory level of component lead_time - Transit time in_transit_qtry - Quantity in transit forecast_x_month - Forecast sales for the net 3, 6, 9 months sales_x_month - Sales quantity for the prior 1, 3, 6, 9 months min_bank - Minimum recommended amount in stock potential_issue - Indictor variable noting potential issue with item pieces_past_due - Parts overdue from source perf_x_months_avg - Source performance in the last 6 and 12 months local_bo_qty - Amount of stock orders overdue X17-X22 - General Risk Flags went_on_back_order - Product went on backorder Validation - indicator variable for training (0), validation (1), and test set (2)
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In recent years, with the continuous improvement of the financial system and the rapid development of the banking industry, the competition of the banking industry itself has intensified. At the same time, with the rapid development of information technology and Internet technology, customers’ choice of financial products is becoming more and more diversified, and customers’ dependence and loyalty to banking institutions is becoming less and less, and the problem of customer churn in commercial banks is becoming more and more prominent. How to predict customer behavior and retain existing customers has become a major challenge for banks to solve. Therefore, this study takes a bank’s business data on Kaggle platform as the research object, uses multiple sampling methods to compare the data for balancing, constructs a bank customer churn prediction model for churn identification by GA-XGBoost, and conducts interpretability analysis on the GA-XGBoost model to provide decision support and suggestions for the banking industry to prevent customer churn. The results show that: (1) The applied SMOTEENN is more effective than SMOTE and ADASYN in dealing with the imbalance of banking data. (2) The F1 and AUC values of the model improved and optimized by XGBoost using genetic algorithm can reach 90% and 99%, respectively, which are optimal compared to other six machine learning models. The GA-XGBoost classifier was identified as the best solution for the customer churn problem. (3) Using Shapley values, we explain how each feature affects the model results, and analyze the features that have a high impact on the model prediction, such as the total number of transactions in the past year, the amount of transactions in the past year, the number of products owned by customers, and the total sales balance. The contribution of this paper is mainly in two aspects: (1) this study can provide useful information from the black box model based on the accurate identification of churned customers, which can provide reference for commercial banks to improve their service quality and retain customers; (2) it can provide reference for customer churn early warning models of other related industries, which can help the banking industry to maintain customer stability, maintain market position and reduce corporate losses.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Titanic Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yasserh/titanic-dataset on 28 January 2022.
--- Dataset description provided by original source is as follows ---
https://raw.githubusercontent.com/Masterx-AI/Project_Titanic_Survival_Prediction_/main/titanic.jpg" alt="">
The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
This dataset has been referred from Kaggle: https://www.kaggle.com/c/titanic/data.
--- Original source retains full ownership of the source dataset ---
This is the test set actual prediction for House Prices Problem. Don't submit this submission into the leaderboard, because it will screw up. Use this for saving your time without submitting again and again
Dataset with the text of 10% of questions and answers from the Stack Overflow programming Q&A website.
This is organized as three tables:
Datasets of all R questions and all Python questions are also available on Kaggle, but this dataset is especially useful for analyses that span many languages.
Example projects include:
All Stack Overflow user contributions are licensed under CC-BY-SA 3.0 with attribution required.
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
About Problem Statement:
Genre: NLP - Problem Type: Contextual Semantic Similarity, Auto-generate Text-based answers
Submission Format: - You need to generate upto 3 distractors for each Question-Answer combination - Each distractor is a string - The 3 distractors/strings need to be separated with a comma (,) - Each value in Results.csv's distractor column will contain the distractors as follows: distractor_for_QnA_1 = "distractor1","distractor2","distractor3"
About the Evaluation Parameter: - All distractor values for 1 question-answer will be converted into a vector form - 1 vector gets generated for submitted distractors and 1 vector is generated for truth value - cosine_similarity between these 2 vectors is evaluated - Similarly, cosine_similarity gets evaluated for all the question-answer combinations - Score of your submitted prediction file = mean ( cosine_similarity between distractor vectors for each entry in test.csv)
Common Issues Faced: How to handle them?:
Download Dataset giving XML error: Try restarting your session after clearing browser cache/cookies and try again. If you still face any issue, please raise a ticket with us.
Upload Prediction File not working: Ensure you are compliant with the Guidelines and FAQs. You will face this error if you exceed the maximum number of prediction file uploads allowed.
Exceptions (Incorrect number of Rows / Incorrect Headers / Prediction missing for a key): For this problem statement, we recommend you to update the 'distractor' column in Results.csv with your predictions, following the format explained above Evaluation is getting stuck in a loop : We recommend you to immediately refresh your session and start afresh with a cleared cache. Please ensure your predictions.csv matches the file format Results.csv. Please check that all the above mentioned checks have been conducted. If you still face any issue, please raise a ticket with us.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset consists of 110,204 admissions of 84,811 hospitalized subjects between 2011 and 2012 in Norway who were diagnosed with infections, systemic inflammatory response syndrome, sepsis by causative microbes, or septic shock. The prediction task is to determine whether a patient survived or is deceased at a time of about 9 days after collecting their medical record at the hospital. This is an important prediction problem in clinical medicine. Sepsis is a life-threatening condition triggered by an immune overreaction to infection, leading to organ failure or even death. Sepsis is associated with immediate death risk, often killing patients within one hour. This renders many laboratory tests and hospital analyses impractical for timely diagnosis and treatment. Being able to predict the survival of patients within minutes with as few and easy-to-retrieve medical features as possible is very important.
What do the instances in this dataset represent?
For the primary cohort, they represent records of patients affected by sepsis potential preconditions (ante Sepsis-3 definition); for the study cohort, they represent only the patients’ admissions defined by the novel Sepsis-3 definition.
Are there recommended data splits?
No recommendation, standard train-test split could be used. Can use three-way holdout split (i.e., training, validation/development, testing) when doing model selection.
Does the dataset contain data that might be considered sensitive in any way?
Yes. It contains information about the gender and age of the patient.
Was there any data preprocessing performed?
All the categorical variables have been encoded (so no preprocessing is necessary).
Additional Information
Primary cohort from Norway: - 4 features for 110,204 patient admissions - file: 's41598-020-73558-3_sepsis_survival_primary_cohort.csv'
Study cohort (a subset of the primary cohort) from Norway: - 4 features for 19,051 patient admissions - file: 's41598-020-73558-3_sepsis_survival_study_cohort.csv'
Validation cohort from South Korea: - 4 features for 137 patients - file: 's41598-020-73558-3_sepsis_survival_validation_cohort.csv'
The validation cohort from South Korea was used by Chicco and Jurman (2020) as an external validation cohort to confirm the generalizability of their proposed approach.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2003977%2Ff3ff5773d5e02d2938b3ac4a0fa23e17%2FScreenshot%20from%202023-10-19%2010-05-33.png?generation=1697673946566777&alt=media" alt="">
Overview Foreseeing bugs, features, and questions on GitHub can be fun, especially when one is provided with a colossal dataset containing the GitHub issues. In this hackathon, we are challenging the MachineHack community to come up with an algorithm that can predict the bugs, features, and questions based on GitHub titles and the text body. With text data, there can be a lot of challenges especially when the dataset is big. Analyzing such a dataset requires a lot to be taken into account mainly due to the preprocessing involved to represent raw text and make them machine-understandable. Usually, we stem and lemmatize the raw information and then represent it using TF-IDF, Word Embeddings, etc.
However, provided the state-of-the-art NLP models such as Transformer based BERT models one can skip the manual feature engineering like TF-IDF and Count Vectorizers. In this short span of time, we would encourage you to leverage the ImageNet moment (Transfer Learning) in NLP using various pre-trained models.
In this hackathon, we also have an interesting learning curve for all the machine learning specialists to write some quality code to win the prizes, as the evaluation involves getting a code quality score using the Embold Code Analysis platform here.
Every participant has to register on the Embold's platform for free as a mandatory step before proceeding with the hackathon
Here is a quick tour of how to use the Embold's Code Analysis Platform for FREE !!
Dataset Description: Train.json - 150000 rows x 3 columns (Includes label Column as Target variable) Test.json - 30000 rows x 2 columns Train_extra.json - 300000 rows x 3 columns (Includes label Column as Target variable) Provided solely for training purposes, can be appended in the train.json for training the model Sample Submission.csv - Please check the Evaluation section for more details on how to generate a valid submission
Attribute Description: Title - the title of the GitHub bug, feature, question Body - the body of the GitHub bug, feature, question Label - Represents various classes of Labels Bug - 0 Feature - 1 Question - 2 Skills: Natural Language Processing Feature extraction from raw text using TF-IDF, CountVectorizer Using Word Embedding to represent words as vectors Using Pretrained models like Transformers, BERT Optimizing accuracy score as a metric to generalize well on unseen data
This dataset was created by AbdelrahmanTarek22
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Imagine you have to process bug reports about an application your company is developing, which is available for both Android and iOS. Could you find a way to automatically classify them so you can send them to the right support team?
The dataset contains data from two StackExchange forums: Android Enthusiasts and Ask Differently (Apple). I pre-processed both datasets from the raw XML files retrieved from Internet Archive in order to only contain useful information for building Machine Learning classifiers. In the case of the Apple forum, I narrowed down to the subset of questions that have one of the following tags: "iOS", "iPhone", "iPad".
Think of this as a fun way to learn to build ML classifiers! The training, validation and test sets are all available, but in order to build robust models please try to use the test set as little as possible (only as a last validation for your models).
The image was retrieved from unsplash and made by @thenewmalcolm. Link to image here.
The data was made available for free under a CC-BY-SA 4.0 license by StackExchange and hosted by Internet Archive. Find it here.
This dataset was created by Shaimaa Othman
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://raw.githubusercontent.com/Masterx-AI/Project_Titanic_Survival_Prediction_/main/titanic.jpg" alt="">
The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
This dataset has been referred from Kaggle: https://www.kaggle.com/c/titanic/data.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F8d3442e6c82d8026c6a448e4780ab38c%2FPicture2.png?generation=1688638685268853&alt=media" alt="">
9. Plot the decision tree
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F9ab0591e323dc30fe116c79f6d014d06%2FPicture3.png?generation=1688638747644320&alt=media" alt="">
Average customer churn is 27%. The churn can take place if the tenure is more than >=7.5 and there is no internet service
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F16080ac04d3743ec238227e1ef2c8269%2FPicture4.png?generation=1688639197455166&alt=media" alt="">
Significant variables are Internet Service, Tenure and the least significant are Streaming Movies, Tech Support.
Run library(randomForest). Here we are using the default ntree (500) and mtry (p/3) where p is the number of
independent variables.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc27fe7e83f0b53b7e067371b69c7f4a7%2FPicture6.png?generation=1688640478682685&alt=media" alt="">
Through confusion matrix, accuracy is coming 79.27%. The accuracy is marginally higher than that of decision tree i.e 79.00%. The error rate is pretty low when predicting "No" and much higher when predicting "Yes".
Plot the model showing which variables reduce the gini impunity the most and least. Total charges and tenure reduce the gini impunity the most while phone service has the least impact.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fec25fc3ba74ab9cef1a81188209512b1%2FPicture7.png?generation=1688640726235724&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F50aa40e5dd676c8285020fd2fe627bf1%2FPicture8.png?generation=1688640896763066&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F87211e1b218c595911fbe6ea2806e27a%2FPicture9.png?generation=1688641103367564&alt=media" alt="">
Tune the model mtry=2 has the lowest OOB error rate
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6057af5bb0719b16f1a97a58c3d4aa1d%2FPicture10.png?generation=1688641391027971&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc7045eba4ee298c58f1bd0230c24c00d%2FPicture11.png?generation=1688641605829830&alt=media" alt="">
Use random forest with mtry = 2 and ntree = 200
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F01541eff1f9c6303591aa50dd707b5f5%2FPicture12.png?generation=1688641634979403&alt=media" alt="">
Through confusion matrix, accuracy is coming 79.71%. The accuracy is marginally higher than that of default (when ntree was 500 and mtry was 4) i.e 79.27% and of decision tree i.e 79.00%. The error rate is pretty low when predicting "No" and m...
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://raw.githubusercontent.com/Masterx-AI/Project_Student_Marks_Prediction_/main/smp.jpg" alt="">
The data consists of Marks of students including their study time & number of courses. The dataset is downloaded from UCI Machine Learning Repository.
Properties of the Dataset:
Number of Instances: 100
Number of Attributes: 3 including the target variable.
The project is simple yet challenging as it is has very limited features & samples. Can you build regression model to capture all the patterns in the dataset, also maitaining the generalisability of the model?