37 datasets found
  1. Student Marks Dataset

    • kaggle.com
    Updated Jan 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M Yasser H (2022). Student Marks Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/student-marks-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 4, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    M Yasser H
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://raw.githubusercontent.com/Masterx-AI/Project_Student_Marks_Prediction_/main/smp.jpg" alt="">

    Description:

    The data consists of Marks of students including their study time & number of courses. The dataset is downloaded from UCI Machine Learning Repository.

    Properties of the Dataset:
    Number of Instances: 100
    Number of Attributes: 3 including the target variable.

    The project is simple yet challenging as it is has very limited features & samples. Can you build regression model to capture all the patterns in the dataset, also maitaining the generalisability of the model?

    Objective:

    • Understand the Dataset & cleanup (if required).
    • Build Regression models to predict the student marks wrt multiple features.
    • Also evaluate the models & compare their respective scores like R2, RMSE, etc.
  2. Ad Click Prediction - Classification Problem

    • kaggle.com
    Updated Jul 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jahanvee Narang (2021). Ad Click Prediction - Classification Problem [Dataset]. https://www.kaggle.com/jahnveenarang/cvdcvd-vd/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 4, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jahanvee Narang
    Description

    **New to machine learning and data science? No question is too basic or too simple. Use this place to post any first-timer clarifying questions for the classification algorithm or related to datasets ** !This file contains demographics about customer and whether that customer clicked the ad or not . You this file to use classification algorithm to predict on the basis of demographics of customer as independent variable

    This data set contains the following features:

    This data set contains the following features:

    1. 'User ID': unique identification for consumer
    2. 'Age': cutomer age in years
    3. 'Estimated Salary': Avg. Income of consumer
    4. 'Gender': Whether consumer was male or female
    5. 'Purchased': 0 or 1 indicated clicking on Ad
  3. Answer Prediction Dataset

    • kaggle.com
    Updated Dec 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Kumar (2022). Answer Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/a2m2a2n2/question-answer-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 23, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aman Kumar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Recently I found this dataset from a problem statement of Inter IIT Tech Meet 11.0 being conducted by IIT Kanpur this year. for more info visit https://interiit-tech.org/ and for the problem statement https://interiit-tech.org/images/ps/High_Devrev.pdf The Dataset is quite interesting and apart from the problem statement, it can also be used in many different ways.

  4. A

    ‘Loan Prediction Problem Dataset’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Aug 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2020). ‘Loan Prediction Problem Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-loan-prediction-problem-dataset-e270/latest
    Explore at:
    Dataset updated
    Aug 4, 2020
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Loan Prediction Problem Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/altruistdelhite04/loan-prediction-problem-dataset on 28 January 2022.

    --- No further description of dataset provided by original source ---

    --- Original source retains full ownership of the source dataset ---

  5. The Stanford Question Answering Dataset

    • kaggle.com
    zip
    Updated Nov 25, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Le Viet Thang (2020). The Stanford Question Answering Dataset [Dataset]. https://www.kaggle.com/toreleon/squad-20-the-stanford-question-answering-dataset
    Explore at:
    zip(10281338 bytes)Available download formats
    Dataset updated
    Nov 25, 2020
    Authors
    Le Viet Thang
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context

    Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

    Content

    There are two files to help you get started with the dataset and evaluate your models:

    • train-v2.0.json
    • dev-v2.0.json

    Acknowledgements

    The original datasets can be found here.

    Inspiration

    • Can you build a prediction model that can accurately predict answers to different types of questions?
    • You can also explore SQuAD here
  6. f

    Comparison results of different model.

    • plos.figshare.com
    xls
    Updated Dec 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ke Peng; Yan Peng; Wenguang Li (2023). Comparison results of different model. [Dataset]. http://doi.org/10.1371/journal.pone.0289724.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 8, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Ke Peng; Yan Peng; Wenguang Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In recent years, with the continuous improvement of the financial system and the rapid development of the banking industry, the competition of the banking industry itself has intensified. At the same time, with the rapid development of information technology and Internet technology, customers’ choice of financial products is becoming more and more diversified, and customers’ dependence and loyalty to banking institutions is becoming less and less, and the problem of customer churn in commercial banks is becoming more and more prominent. How to predict customer behavior and retain existing customers has become a major challenge for banks to solve. Therefore, this study takes a bank’s business data on Kaggle platform as the research object, uses multiple sampling methods to compare the data for balancing, constructs a bank customer churn prediction model for churn identification by GA-XGBoost, and conducts interpretability analysis on the GA-XGBoost model to provide decision support and suggestions for the banking industry to prevent customer churn. The results show that: (1) The applied SMOTEENN is more effective than SMOTE and ADASYN in dealing with the imbalance of banking data. (2) The F1 and AUC values of the model improved and optimized by XGBoost using genetic algorithm can reach 90% and 99%, respectively, which are optimal compared to other six machine learning models. The GA-XGBoost classifier was identified as the best solution for the customer churn problem. (3) Using Shapley values, we explain how each feature affects the model results, and analyze the features that have a high impact on the model prediction, such as the total number of transactions in the past year, the amount of transactions in the past year, the number of products owned by customers, and the total sales balance. The contribution of this paper is mainly in two aspects: (1) this study can provide useful information from the black box model based on the accurate identification of churned customers, which can provide reference for commercial banks to improve their service quality and retain customers; (2) it can provide reference for customer churn early warning models of other related industries, which can help the banking industry to maintain customer stability, maintain market position and reduce corporate losses.

  7. Student Engagement

    • kaggle.com
    Updated Nov 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Student Engagement [Dataset]. https://www.kaggle.com/datasets/thedevastator/student-engagement-with-tableau-a-data-science-p
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 23, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Student Engagement

    Predicting Engagement and Exam Performance

    By [source]

    About this dataset

    This dataset contains information on student engagement with Tableau, including quizzes, exams, and lessons. The data includes the course title, the rating of the course, the date the course was rated, the exam category, the exam duration, whether the answer was correct or not, the number of quizzes completed, the number of exams completed, the number of lessons completed, the date engaged, the exam result, and more

    How to use the dataset

    The 'Student Engagement with Tableau' dataset offers insights into student engagement with the Tableau software. The data includes information on courses, exams, quizzes, and student learning.

    This dataset can be used to examine how students use Tableau, what kind of engagement leads to better learning outcomes, and whether certain course or exam characteristics are associated with student engagement

    Research Ideas

    • Creating a heat map of student engagement by course and location
    • Determining which courses are most popular among students from different countries
    • Identifying patterns in students' exam results

    Acknowledgements

    Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: 365_course_info.csv | Column name | Description | |:-----------------|:----------------------------------| | course_title | The title of the course. (String) |

    File: 365_course_ratings.csv | Column name | Description | |:------------------|:---------------------------------------------------------| | course_rating | The rating given to the course by the student. (Numeric) | | date_rated | The date on which the course was rated. (Date) |

    File: 365_exam_info.csv | Column name | Description | |:------------------|:-------------------------------------------------| | exam_category | The category of the exam. (Categorical) | | exam_duration | The duration of the exam in minutes. (Numerical) |

    File: 365_quiz_info.csv | Column name | Description | |:-------------------|:----------------------------------------------------------------------| | answer_correct | Whether or not the student answered the question correctly. (Boolean) |

    File: 365_student_engagement.csv | Column name | Description | |:-----------------------|:------------------------------------------------------------------| | engagement_quizzes | The number of times a student has engaged with quizzes. (Numeric) | | engagement_exams | The number of times a student has engaged with exams. (Numeric) | | engagement_lessons | The number of times a student has engaged with lessons. (Numeric) | | date_engaged | The date of the student's engagement. (Date) |

    File: 365_student_exams.csv | Column name | Description | |:-------------------------|:---------------------------------------------------| | exam_result | The result of the exam. (Categorical) | | exam_completion_time | The time it took to complete the exam. (Numerical) | | date_exam_completed | The date the exam was completed. (Date) |

    File: 365_student_hub_questions.csv | Column name | Description | |:------------------------|:----------------------------------------| | date_question_asked | The date the question was asked. (Date) |

    File: 365_student_info.csv | Column name | Description | |:--------------------|:-------------------------------------------------------| | student_country | The country of the student. (Categorical) | | date_registered | The date the student registered for the course. (Date) |

    File: 365_student_learning.csv | Column name | Description | |:--------------------|:------------------------------...

  8. Data from: Back Order Prediction Dataset

    • kaggle.com
    Updated Nov 7, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gowtham Miryala (2021). Back Order Prediction Dataset [Dataset]. https://www.kaggle.com/gowthammiryala/back-order-prediction-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 7, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Gowtham Miryala
    Description

    Problem Statement

    In a supply chain system, Material backorder is a common problem, impacting an inventory system service level and effectiveness. Identifying parts with the highest chances of shortage prior to its occurrence can present a high opportunity to improve an overall company’s performance. In this project, we will train classifiers to predict future back-ordered products and generate predictions for a test set.

    File descriptions

    Here we have two CSV files (Training_BOP.csv and Testing_BOP.csv). Training_BOP.csv - the training set Testing_BOP.csv - the testing set

    Each file has 23 columns and the last column (went_on_backorder) is the target column.

    Data fields

    sku - sku code national_inv - Current inventory level of component lead_time - Transit time in_transit_qtry - Quantity in transit forecast_x_month - Forecast sales for the net 3, 6, 9 months sales_x_month - Sales quantity for the prior 1, 3, 6, 9 months min_bank - Minimum recommended amount in stock potential_issue - Indictor variable noting potential issue with item pieces_past_due - Parts overdue from source perf_x_months_avg - Source performance in the last 6 and 12 months local_bo_qty - Amount of stock orders overdue X17-X22 - General Risk Flags went_on_back_order - Product went on backorder Validation - indicator variable for training (0), validation (1), and test set (2)

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  9. f

    Details of feature variables of the data set.

    • plos.figshare.com
    xls
    Updated Dec 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ke Peng; Yan Peng; Wenguang Li (2023). Details of feature variables of the data set. [Dataset]. http://doi.org/10.1371/journal.pone.0289724.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 8, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Ke Peng; Yan Peng; Wenguang Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In recent years, with the continuous improvement of the financial system and the rapid development of the banking industry, the competition of the banking industry itself has intensified. At the same time, with the rapid development of information technology and Internet technology, customers’ choice of financial products is becoming more and more diversified, and customers’ dependence and loyalty to banking institutions is becoming less and less, and the problem of customer churn in commercial banks is becoming more and more prominent. How to predict customer behavior and retain existing customers has become a major challenge for banks to solve. Therefore, this study takes a bank’s business data on Kaggle platform as the research object, uses multiple sampling methods to compare the data for balancing, constructs a bank customer churn prediction model for churn identification by GA-XGBoost, and conducts interpretability analysis on the GA-XGBoost model to provide decision support and suggestions for the banking industry to prevent customer churn. The results show that: (1) The applied SMOTEENN is more effective than SMOTE and ADASYN in dealing with the imbalance of banking data. (2) The F1 and AUC values of the model improved and optimized by XGBoost using genetic algorithm can reach 90% and 99%, respectively, which are optimal compared to other six machine learning models. The GA-XGBoost classifier was identified as the best solution for the customer churn problem. (3) Using Shapley values, we explain how each feature affects the model results, and analyze the features that have a high impact on the model prediction, such as the total number of transactions in the past year, the amount of transactions in the past year, the number of products owned by customers, and the total sales balance. The contribution of this paper is mainly in two aspects: (1) this study can provide useful information from the black box model based on the accurate identification of churned customers, which can provide reference for commercial banks to improve their service quality and retain customers; (2) it can provide reference for customer churn early warning models of other related industries, which can help the banking industry to maintain customer stability, maintain market position and reduce corporate losses.

  10. A

    ‘Titanic Dataset’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Titanic Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-titanic-dataset-bec7/bfa18318/?iid=006-936&v=presentation
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Titanic Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yasserh/titanic-dataset on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    https://raw.githubusercontent.com/Masterx-AI/Project_Titanic_Survival_Prediction_/main/titanic.jpg" alt="">

    Description:

    The sinking of the Titanic is one of the most infamous shipwrecks in history.

    On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew.

    While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

    In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

    Acknowledgements:

    This dataset has been referred from Kaggle: https://www.kaggle.com/c/titanic/data.

    Objective:

    • Understand the Dataset & cleanup (if required).
    • Build a strong classification model to predict whether the passenger survives or not.
    • Also fine-tune the hyperparameters & compare the evaluation metrics of various classification algorithms.

    --- Original source retains full ownership of the source dataset ---

  11. Perfect Score House Prices

    • kaggle.com
    Updated May 26, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Efrad Galio (2020). Perfect Score House Prices [Dataset]. https://www.kaggle.com/datasets/efradgamer/perfect-score-house-prices/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 26, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Efrad Galio
    Description

    This is the test set actual prediction for House Prices Problem. Don't submit this submission into the leaderboard, because it will screw up. Use this for saving your time without submitting again and again

  12. StackSample: 10% of Stack Overflow Q&A

    • kaggle.com
    Updated Oct 8, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stack Overflow (2019). StackSample: 10% of Stack Overflow Q&A [Dataset]. https://www.kaggle.com/stackoverflow/stacksample/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 8, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Stack Overflow
    Description

    Dataset with the text of 10% of questions and answers from the Stack Overflow programming Q&A website.

    This is organized as three tables:

    • Questions contains the title, body, creation date, closed date (if applicable), score, and owner ID for all non-deleted Stack Overflow questions whose Id is a multiple of 10.
    • Answers contains the body, creation date, score, and owner ID for each of the answers to these questions. The ParentId column links back to the Questions table.
    • Tags contains the tags on each of these questions

    Datasets of all R questions and all Python questions are also available on Kaggle, but this dataset is especially useful for analyses that span many languages.

    Example projects include:

    • Identifying tags from question text
    • Predicting whether questions will be upvoted, downvoted, or closed based on their text
    • Predicting how long questions will take to answer

    License

    All Stack Overflow user contributions are licensed under CC-BY-SA 3.0 with attribution required.

  13. Question-Answer combination

    • kaggle.com
    Updated Jan 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ailurophile (2020). Question-Answer combination [Dataset]. https://www.kaggle.com/veeralakrishna/questionanswer-combination/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 9, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ailurophile
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Valuelabs ML Hackathon

    About Problem Statement: Genre: NLP - Problem Type: Contextual Semantic Similarity, Auto-generate Text-based answers

    Submission Format: - You need to generate upto 3 distractors for each Question-Answer combination - Each distractor is a string - The 3 distractors/strings need to be separated with a comma (,) - Each value in Results.csv's distractor column will contain the distractors as follows: distractor_for_QnA_1 = "distractor1","distractor2","distractor3"

    About the Evaluation Parameter: - All distractor values for 1 question-answer will be converted into a vector form - 1 vector gets generated for submitted distractors and 1 vector is generated for truth value - cosine_similarity between these 2 vectors is evaluated - Similarly, cosine_similarity gets evaluated for all the question-answer combinations - Score of your submitted prediction file = mean ( cosine_similarity between distractor vectors for each entry in test.csv)

    Common Issues Faced: How to handle them?:

    Download Dataset giving XML error: Try restarting your session after clearing browser cache/cookies and try again. If you still face any issue, please raise a ticket with us.

    Upload Prediction File not working: Ensure you are compliant with the Guidelines and FAQs. You will face this error if you exceed the maximum number of prediction file uploads allowed.

    Exceptions (Incorrect number of Rows / Incorrect Headers / Prediction missing for a key): For this problem statement, we recommend you to update the 'distractor' column in Results.csv with your predictions, following the format explained above Evaluation is getting stuck in a loop : We recommend you to immediately refresh your session and start afresh with a cleared cache. Please ensure your predictions.csv matches the file format Results.csv. Please check that all the above mentioned checks have been conducted. If you still face any issue, please raise a ticket with us.

  14. Sepsis Survival Prediction

    • kaggle.com
    Updated Oct 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joakim Arvidsson (2023). Sepsis Survival Prediction [Dataset]. https://www.kaggle.com/datasets/joebeachcapital/sepsis-survival-minimal-clinical-records
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 19, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Joakim Arvidsson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset consists of 110,204 admissions of 84,811 hospitalized subjects between 2011 and 2012 in Norway who were diagnosed with infections, systemic inflammatory response syndrome, sepsis by causative microbes, or septic shock. The prediction task is to determine whether a patient survived or is deceased at a time of about 9 days after collecting their medical record at the hospital. This is an important prediction problem in clinical medicine. Sepsis is a life-threatening condition triggered by an immune overreaction to infection, leading to organ failure or even death. Sepsis is associated with immediate death risk, often killing patients within one hour. This renders many laboratory tests and hospital analyses impractical for timely diagnosis and treatment. Being able to predict the survival of patients within minutes with as few and easy-to-retrieve medical features as possible is very important.

    What do the instances in this dataset represent?

    For the primary cohort, they represent records of patients affected by sepsis potential preconditions (ante Sepsis-3 definition); for the study cohort, they represent only the patients’ admissions defined by the novel Sepsis-3 definition.

    Are there recommended data splits?

    No recommendation, standard train-test split could be used. Can use three-way holdout split (i.e., training, validation/development, testing) when doing model selection.

    Does the dataset contain data that might be considered sensitive in any way?

    Yes. It contains information about the gender and age of the patient.

    Was there any data preprocessing performed?

    All the categorical variables have been encoded (so no preprocessing is necessary).

    Additional Information

    Primary cohort from Norway: - 4 features for 110,204 patient admissions - file: 's41598-020-73558-3_sepsis_survival_primary_cohort.csv'

    Study cohort (a subset of the primary cohort) from Norway: - 4 features for 19,051 patient admissions - file: 's41598-020-73558-3_sepsis_survival_study_cohort.csv'

    Validation cohort from South Korea: - 4 features for 137 patients - file: 's41598-020-73558-3_sepsis_survival_validation_cohort.csv'

    The validation cohort from South Korea was used by Chicco and Jurman (2020) as an external validation cohort to confirm the generalizability of their proposed approach.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2003977%2Ff3ff5773d5e02d2938b3ac4a0fa23e17%2FScreenshot%20from%202023-10-19%2010-05-33.png?generation=1697673946566777&alt=media" alt="">

  15. GitHub Bugs Prediction Challenge (Machine Hack)

    • kaggle.com
    Updated Oct 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ask9 (2020). GitHub Bugs Prediction Challenge (Machine Hack) [Dataset]. https://www.kaggle.com/arbazkhan971/github-bugs-prediction-challenge-machine-hack/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 8, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    ask9
    Description

    Overview Foreseeing bugs, features, and questions on GitHub can be fun, especially when one is provided with a colossal dataset containing the GitHub issues. In this hackathon, we are challenging the MachineHack community to come up with an algorithm that can predict the bugs, features, and questions based on GitHub titles and the text body. With text data, there can be a lot of challenges especially when the dataset is big. Analyzing such a dataset requires a lot to be taken into account mainly due to the preprocessing involved to represent raw text and make them machine-understandable. Usually, we stem and lemmatize the raw information and then represent it using TF-IDF, Word Embeddings, etc.

    However, provided the state-of-the-art NLP models such as Transformer based BERT models one can skip the manual feature engineering like TF-IDF and Count Vectorizers. In this short span of time, we would encourage you to leverage the ImageNet moment (Transfer Learning) in NLP using various pre-trained models.

    In this hackathon, we also have an interesting learning curve for all the machine learning specialists to write some quality code to win the prizes, as the evaluation involves getting a code quality score using the Embold Code Analysis platform here.

    Every participant has to register on the Embold's platform for free as a mandatory step before proceeding with the hackathon

    Here is a quick tour of how to use the Embold's Code Analysis Platform for FREE !!

    Dataset Description: Train.json - 150000 rows x 3 columns (Includes label Column as Target variable) Test.json - 30000 rows x 2 columns Train_extra.json - 300000 rows x 3 columns (Includes label Column as Target variable) Provided solely for training purposes, can be appended in the train.json for training the model Sample Submission.csv - Please check the Evaluation section for more details on how to generate a valid submission

    Attribute Description: Title - the title of the GitHub bug, feature, question Body - the body of the GitHub bug, feature, question Label - Represents various classes of Labels Bug - 0 Feature - 1 Question - 2 Skills: Natural Language Processing Feature extraction from raw text using TF-IDF, CountVectorizer Using Word Embedding to represent words as vectors Using Pretrained models like Transformers, BERT Optimizing accuracy score as a metric to generalize well on unseen data

  16. Predict Loan Approval Problem

    • kaggle.com
    Updated Nov 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AbdelrahmanTarek22 (2021). Predict Loan Approval Problem [Dataset]. https://www.kaggle.com/abdelrahmantarek22/predict-loan-approval-problem/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 11, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    AbdelrahmanTarek22
    Description

    Dataset

    This dataset was created by AbdelrahmanTarek22

    Contents

  17. Question Classification: Android or iOS?

    • kaggle.com
    zip
    Updated Oct 29, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    xhlulu (2020). Question Classification: Android or iOS? [Dataset]. https://www.kaggle.com/xhlulu/question-classification-android-or-ios
    Explore at:
    zip(19598168 bytes)Available download formats
    Dataset updated
    Oct 29, 2020
    Authors
    xhlulu
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context

    Imagine you have to process bug reports about an application your company is developing, which is available for both Android and iOS. Could you find a way to automatically classify them so you can send them to the right support team?

    Content

    The dataset contains data from two StackExchange forums: Android Enthusiasts and Ask Differently (Apple). I pre-processed both datasets from the raw XML files retrieved from Internet Archive in order to only contain useful information for building Machine Learning classifiers. In the case of the Apple forum, I narrowed down to the subset of questions that have one of the following tags: "iOS", "iPhone", "iPad".

    Think of this as a fun way to learn to build ML classifiers! The training, validation and test sets are all available, but in order to build robust models please try to use the test set as little as possible (only as a last validation for your models).

    Acknowledgements

    The image was retrieved from unsplash and made by @thenewmalcolm. Link to image here.

    The data was made available for free under a CC-BY-SA 4.0 license by StackExchange and hosted by Internet Archive. Find it here.

  18. Loan_Prediction (Numerical_values)

    • kaggle.com
    zip
    Updated Jan 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaimaa Othman (2021). Loan_Prediction (Numerical_values) [Dataset]. https://www.kaggle.com/shaimaa1234/loan-prediction-numerical-values
    Explore at:
    zip(39514 bytes)Available download formats
    Dataset updated
    Jan 12, 2021
    Authors
    Shaimaa Othman
    Description

    Dataset

    This dataset was created by Shaimaa Othman

    Contents

  19. Titanic Dataset

    • kaggle.com
    Updated Dec 24, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M Yasser H (2021). Titanic Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/titanic-dataset/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 24, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    M Yasser H
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://raw.githubusercontent.com/Masterx-AI/Project_Titanic_Survival_Prediction_/main/titanic.jpg" alt="">

    Description:

    The sinking of the Titanic is one of the most infamous shipwrecks in history.

    On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew.

    While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

    In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

    Acknowledgements:

    This dataset has been referred from Kaggle: https://www.kaggle.com/c/titanic/data.

    Objective:

    • Understand the Dataset & cleanup (if required).
    • Build a strong classification model to predict whether the passenger survives or not.
    • Also fine-tune the hyperparameters & compare the evaluation metrics of various classification algorithms.
  20. Customer Churn - Decision Tree & Random Forest

    • kaggle.com
    Updated Jul 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vikram amin (2023). Customer Churn - Decision Tree & Random Forest [Dataset]. https://www.kaggle.com/datasets/vikramamin/customer-churn-decision-tree-and-random-forest
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 6, 2023
    Dataset provided by
    Kaggle
    Authors
    vikram amin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description
    • Main objective: Find out customers who will churn and who will not.
    • Methodology: It is a classification problem. We will use decision tree and random forest to predict the outcome.
    • Steps Involved
    • Read the data
    • Check for data types https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F1ffb600d8a4b4b36bc25e957524a3524%2FPicture1.png?generation=1688638600831386&alt=media" alt="">
    1. Change character vector to factor vector as this is as classification problem
    2. Drop the variable which is not significant for the analysis. We drop "customerID".
    3. Check for missing values. None are found.
    4. Split the data into train and test so we can use the train data for building the model and use test data for prediction. We split this into 80-20 ratio (train/test) using the sample function.
    5. Install and run libraries (rpart, rpart.plot, rattle, RColorBrewer, caret)
    6. Run decision tree using rpart function. The dependent variable is Churn and 19 other independent variables

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F8d3442e6c82d8026c6a448e4780ab38c%2FPicture2.png?generation=1688638685268853&alt=media" alt=""> 9. Plot the decision tree

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F9ab0591e323dc30fe116c79f6d014d06%2FPicture3.png?generation=1688638747644320&alt=media" alt="">

    Average customer churn is 27%. The churn can take place if the tenure is more than >=7.5 and there is no internet service

    1. Tuning the model
    2. Define the search grid using the expand.grid function
    3. Set up the control parameters through 5 fold cross validation
    4. When we print the model we get the best CP = 0.01 and an accuracy of 79.00%

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F16080ac04d3743ec238227e1ef2c8269%2FPicture4.png?generation=1688639197455166&alt=media" alt="">

    1. Predict the model
    2. Find out the variables which are most and least significant. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F61beb4224e9351cfc772147c43800502%2FPicture5.png?generation=1688639468638950&alt=media" alt="">

    Significant variables are Internet Service, Tenure and the least significant are Streaming Movies, Tech Support.

    USE RANDOM FOREST

    1. Run library(randomForest). Here we are using the default ntree (500) and mtry (p/3) where p is the number of independent variables. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc27fe7e83f0b53b7e067371b69c7f4a7%2FPicture6.png?generation=1688640478682685&alt=media" alt="">

      Through confusion matrix, accuracy is coming 79.27%. The accuracy is marginally higher than that of decision tree i.e 79.00%. The error rate is pretty low when predicting "No" and much higher when predicting "Yes".

    2. Plot the model showing which variables reduce the gini impunity the most and least. Total charges and tenure reduce the gini impunity the most while phone service has the least impact.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fec25fc3ba74ab9cef1a81188209512b1%2FPicture7.png?generation=1688640726235724&alt=media" alt="">

    1. Predict the model and create a new data frame showing the actuals vs predicted values

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F50aa40e5dd676c8285020fd2fe627bf1%2FPicture8.png?generation=1688640896763066&alt=media" alt="">

    1. Plot the model so as to find out where the OOB (out of bag ) error stops decreasing or becoming constant. As we can see that the error stops decreasing between 100 to 200 trees. So we decide to take ntree = 200 when we tune the model.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F87211e1b218c595911fbe6ea2806e27a%2FPicture9.png?generation=1688641103367564&alt=media" alt="">

    Tune the model mtry=2 has the lowest OOB error rate

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6057af5bb0719b16f1a97a58c3d4aa1d%2FPicture10.png?generation=1688641391027971&alt=media" alt="">

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc7045eba4ee298c58f1bd0230c24c00d%2FPicture11.png?generation=1688641605829830&alt=media" alt="">

    Use random forest with mtry = 2 and ntree = 200

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F01541eff1f9c6303591aa50dd707b5f5%2FPicture12.png?generation=1688641634979403&alt=media" alt="">

    Through confusion matrix, accuracy is coming 79.71%. The accuracy is marginally higher than that of default (when ntree was 500 and mtry was 4) i.e 79.27% and of decision tree i.e 79.00%. The error rate is pretty low when predicting "No" and m...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
M Yasser H (2022). Student Marks Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/student-marks-dataset/code
Organization logo

Student Marks Dataset

Student Marks Prediction - Regression Problem

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 4, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
M Yasser H
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

https://raw.githubusercontent.com/Masterx-AI/Project_Student_Marks_Prediction_/main/smp.jpg" alt="">

Description:

The data consists of Marks of students including their study time & number of courses. The dataset is downloaded from UCI Machine Learning Repository.

Properties of the Dataset:
Number of Instances: 100
Number of Attributes: 3 including the target variable.

The project is simple yet challenging as it is has very limited features & samples. Can you build regression model to capture all the patterns in the dataset, also maitaining the generalisability of the model?

Objective:

  • Understand the Dataset & cleanup (if required).
  • Build Regression models to predict the student marks wrt multiple features.
  • Also evaluate the models & compare their respective scores like R2, RMSE, etc.
Search
Clear search
Close search
Google apps
Main menu