This dataset was created by chinthakindi vinod
This dataset was created by Jahanvee Narang
**New to machine learning and data science? No question is too basic or too simple. Use this place to post any first-timer clarifying questions for the classification algorithm or related to datasets ** !This file contains demographics about customer and whether that customer clicked the ad or not . You this file to use classification algorithm to predict on the basis of demographics of customer as independent variable
This data set contains the following features:
This data set contains the following features:
This dataset was created by Kushal Dev
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
🇬🇧 English:
This synthetic dataset is designed for practicing fake news detection using natural language processing (NLP) techniques. It contains 1000 news samples labeled as "real" or "fake", including fabricated headlines and articles that mimic real-world patterns.
You can use this dataset to:
Train NLP classification models like Logistic Regression, SVM, BERT Perform feature engineering on textual data Practice binary classification problems in news analytics Columns:
title: News headline text: Main body of the news label: Label indicating whether the news is fake or real 🇹🇷 Türkçe:
Bu sentetik veri seti, doğal dil işleme (NLP) teknikleri kullanarak sahte haber tespiti pratiği yapmak isteyen araştırmacılar ve öğrenciler için tasarlanmıştır. 1000 örnek haber içermektedir ve her biri "real" (gerçek) veya "fake" (sahte) olarak etiketlenmiştir. Haber başlıkları ve içerikleri gerçek dünyayı taklit edecek şekilde oluşturulmuştur.
Bu veri seti sayesinde:
Logistic Regression, SVM, BERT gibi NLP modelleri eğitilebilir Metin üzerinde öznitelik mühendisliği yapılabilir Sahte haber tespiti üzerine sınıflandırma çalışmaları yürütülebilir Değişkenler:
title: Haber başlığı text: Haber içeriği label: Etiket (fake/real)
Original Data Source: Fake News Detection
This dataset was created by Karthikeyan Raghav
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
🇬🇧 English:
This dataset contains 1,000 synthetic email messages labeled as either spam or ham. It was created to help users build and evaluate text classification models using basic natural language processing (NLP) techniques.
Use this dataset to:
Train a spam filter using Naive Bayes, SVM, or Logistic Regression Practice text cleaning, tokenization, and TF-IDF vectorization Build email classification models without needing real personal email data 🇹🇷 Türkçe:
Bu veri seti, 1.000 adet sentetik e-posta mesajı içerir ve her bir mesaj spam ya da ham (normal) olarak etiketlenmiştir. Doğal dil işleme teknikleriyle spam tespiti modeli geliştirmek isteyenler için hazırlanmıştır.
Bu veri seti ile:
Naive Bayes, SVM gibi metin sınıflandırma modelleri geliştirilebilir Metin temizleme, tokenizasyon ve TF-IDF uygulamaları yapılabilir Gerçek e-postalara gerek kalmadan NLP pratiği yapılabilir
Original Data Source: Spam Mail Classifier Dataset
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Iris Flower Classification is a classic machine learning task used for learning and practicing classification algorithms. The dataset contains features like sepal length, sepal width, petal length, and petal width for three different species of iris flowers. This project involves data pre-processing, model selection, and evaluation. Here, we use classification algorithms like logistic regression, decision trees, k-nearest neighbors (KNN), or support vector machines (SVM) for this classification task.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is a American Sign Language Digits Dataset, from sign 0 to sign 9. This dataset uses depth information for generating hand key-points (using MediaPipe), which enriches the dataset and enhances the accuracy during classification.
This is a American Sign Language Digits Dataset, using MediaPipe framework, which accurately detects the hand & 21 hand key-points from a raw RGB image, and stores the co-ordinate values of these key-points. The dataset contains 5000 such raw image files from sign 0 to sign 9 (500 files of each sign) and 5000 corresponding output image files (applying MediaPipe). After generating the dataset, we have also done the classification, using different classifiers, such as KNN, SVM, RFC, DTC, Neural Networks etc. Accuracies for different classifiers are yielded in the classification code (in code section).
A New 2D Static Hand Gesture Colour Image Dataset for ASL Gestures - A.L.C. Barczak, N.H. Reyes, M. Abastillas, A. Piccio and T. Susnjak
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Dementia Prediction Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/shashwatwork/dementia-prediction-dataset on 13 February 2022.
--- Dataset description provided by original source is as follows ---
Dementia is a syndrome – usually of a chronic or progressive nature – in which there is deterioration in cognitive function (i.e. the ability to process thought) beyond what might be expected from normal aging. It affects memory, thinking, orientation, comprehension, calculation, learning capacity, language, and judgment. Consciousness is not affected. The impairment in cognitive function is commonly accompanied and occasionally preceded, by deterioration in emotional control, social behaviou, or motivation.
Dementia results from a variety of diseases and injuries that primarily or secondarily affect the brain, such as Alzheimer's disease or stroke.
Dementia is one of the major causes of disability and dependency among older people worldwide. It can be overwhelming, not only for the people who have it, but also for their carers and families. There is often a lack of awareness and understanding of dementia, resulting in stigmatization and barriers to diagnosis and care. The impact of dementia on carers, family, and society at large can be physical, psychological, social and e and economic
This set consists of a longitudinal collection of 150 subjects aged 60 to 96. Each subject was scanned on two or more visits, separated by at least one year for a total of 373 imaging sessions. For each subject, 3 or 4 individual T1-weighted MRI scans obtained in single scan sessions are included. The subjects are all right-handed and include both men and women. 72 of the subjects were characterized as nondemented throughout the study. 64 of the included subjects were characterized as demented at the time of their initial visits and remained so for subsequent scans, including 51 individuals with mild to moderate Alzheimer’s disease. Another 14 subjects were characterized as nondemented at the time of their initial visit and were subsequently characterized as demented at a later visit
Battineni, Gopi; Amenta, Francesco; Chintalapudi, Nalini (2019), “Data for: MACHINE LEARNING IN MEDICINE: CLASSIFICATION AND PREDICTION OF DEMENTIA BY SUPPORT VECTOR MACHINES (SVM)”, Mendeley Data, V1, doi: 10.17632/tsy6rbc5d4.1 * Dataset is available here.
--- Original source retains full ownership of the source dataset ---
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This Dataset contains complete data on customer recalls for different banking companies, the data is not clean so before using it you will need to do exploratory data analysis for more complex models. If you are using simpler models you can simply take the column with the stars and the feedback. (You can see my example code with this dataset). Good luck @💯 !!!
This dataset was created by Sankar
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is the South African Lottery results from year 2000 when it started to 2015. I was interested in predicting whether there will be winners or not given the following publicly available information prior to betting:
The above mentioned features attract quite a lot of consumers and with an increase in the betters increase the chances of winning.
This classifier is able to achieve 98% score and correctly predict against the X_test set on whether there will be a division 1 jackpot winner or not. Winner is 1 and no-winner is 0.
The reason its 98% prediction is only because if there are 2 winners on division 1, it cannot predict this and hence if compared to the test set, it's not wholly accurate.
The data was acquired from the National Lottery website. Please look at: https://www.nationallottery.co.za/lotto-history/?game=Lotto for further information
I am only new to machine learning, being a Chemical Engineer by vocation, I came across this sphere of knowledge and I must admit, most of my nights are spent just coding away and trying to predict the most ludicrous datasets I can dream up. However, its all been a lot of fun, and with every exercise I tend to learn a lot more.
One of my challenges is in visualising this data. I tried meshgrid and contourf plots, but getting errors. Also is it possible to to predict the number of division 1 winners? In the y_train data, there are a number of instances where there was more than 1 division 1 winners. However, the SVM was made only to be able to predict 0 for no winners or 1 for winners.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
DATASET: https://www.muratkoklu.com/datasets/
Citation Request : KOKLU, M., SARIGIL, S., & OZBEK, O. (2021). The use of machine learning methods in classification of pumpkin seeds (Cucurbita pepo L.). Genetic Resources and Crop Evolution, 68(7), 2713-2726. Doi: https://doi.org/10.1007/s10722-021-01226-0
https://link.springer.com/article/10.1007/s10722-021-01226-0 https://link.springer.com/content/pdf/10.1007/s10722-021-01226-0.pdf
DATASET: https://www.muratkoklu.com/datasets/
Abstract: Pumpkin seeds are frequently consumed as confection worldwide because of their adequate amount of protein, fat, carbohydrate, and mineral contents. This study was carried out on the two most important and quality types of pumpkin seeds, ‘‘Urgup_Sivrisi’’ and ‘‘Cercevelik’’, generally grown in Urgup and Karacaoren regions in Turkey. However, morphological measurements of 2500 pumpkin seeds of both varieties were made possible by using the gray and binary forms of threshold techniques. Considering morphological features, all the data were modeled with five different machine learning methods: Logistic Regression (LR), Multilayer Perceptrons (MLP), Support Vector Machine (SVM) and Random Forest (RF), and k-Nearest Neighbor (k-NN), which further determined the most successful method for classifying pumpkin seed varieties. However, the performances of the models were determined with the help of the 10 kfold cross-validation method. The accuracy rates of the classifiers were obtained as LR 87.92 percent, MLP 88.52 percent, SVM 88.64 percent, RF 87.56 percent, and k-NN 87.64 percent.
Keywords Pumpkin seed Logistic regression, Multilayer peceptrons, Random forest, Classification, Support vector machine, Thresholding
DATASET: https://www.muratkoklu.com/datasets/
This dataset helps in identifying counterfeit banknotes based on statistical features extracted from genuine and forged currency notes. It contains attributes such as variance, skewness, and entropy, which are derived from images of banknotes using wavelet transformation.
Dataset Details:
This dataset is widely used for classification tasks and ML model evaluation in fraud detection.
dataset : Pumpkin seeds are frequently consumed as confection worldwide because of their adequate amount of protein, fat, carbohydrate, and mineral contents. This study was carried out on the two most important and quality types of pumpkin seeds, ‘‘Urgup_Sivrisi’’ and ‘‘Cercevelik’’, generally grown in Urgup and Karacaoren regions in Turkey. However, morphological measurements of 2500 pumpkin seeds of both varieties were made possible by using the gray and binary forms of threshold techniques. Considering morphological features, all the data were modeled with five different machine learning methods: Logistic Regression (LR), Multilayer Perceptrons (MLP), Support Vector Machine (SVM) and Random Forest (RF), and k-Nearest Neighbor (k-NN), which further determined the most successful method for classifying pumpkin seed varieties. However, the performances of the models were determined with the help of the 10 kfold cross-validation method. The accuracy rates of the classifiers were obtained as LR 87.92 percent, MLP 88.52 percent, SVM 88.64 percent, RF 87.56 percent, and k-NN 87.64 percent.
Keywords Pumpkin seed Logistic regression, Multilayer peceptrons, Random forest, Classification, Support vector machine, Thresholding
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This is a Bangla Sign Language Dataset, using MediaPipe framework, which accurately detects the hand & 21 hand key-points from a raw RGB image, and stores the co-ordinate values of these key-points. After collecting 47000 such raw image files for 47 signs (100 files per sign per user) and generating 47000 corresponding output image files applying MediaPipe, the co-ordinate values of these key-points are stored in a .csv files. This dataset contains 470 such .csv files (collected from 10 users for 47 signs in total). After generating the dataset, we have also done the classification, using different classifiers, such as KNN, SVM, RFC, DTC, Neural Networks etc. Accuracies for different classifiers are yielded in the classification code (in code section).
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
📄 Description: This dataset consists of 1,156 images of four major chestnut (Castanea sativa) varieties cultivated in Turkey: Alandız, Aydın, Simav, and Zonguldak. Images were captured under controlled lighting conditions using a Samsung NX300 camera, from both front and back angles to ensure diversity. Each folder in the dataset corresponds to a specific chestnut variety.
The dataset has been used in multiple academic studies and is suitable for developing and testing image classification algorithms, deep learning models, and computer vision systems in agriculture and food technology.
📚 Citation Request: If you use this dataset in your research or application, cite the following studies:
Yurdakul, M., Uyar, K., & Taşdemir, Ş. Webserver-Based Mobile Application for Multi-class Chestnut (Castanea sativa) Classification Using Deep Features and Attention Mechanisms, Applied Fruit Science, 2025, 67:102. Springer DOI: https://doi.org/10.1007/s10341-025-01327-5
Yurdakul, M., Atabaş, İ., & Taşdemir, Ş. (2024, March). Chestnut (Castanea Sativa) Varieties Classification with Harris Hawks Optimization based Selected Features and SVM. In 2024 International Conference on Advances in Computing, Communication, Electrical, and Smart Systems (iCACCESS) (pp. 1-5). IEEE.
🧾 Folder Structure: 📁 alandız – 272 images
📁 aydın – 228 images
📁 simav – 304 images
📁 zonguldak – 352 images
All images are in .jpg format and represent single chestnuts from different angles.
🧠 Potential Use Cases: Image classification
Machine learning & deep learning model development
Feature selection and optimization benchmarking
Agricultural and food product recognition
http://www.gnu.org/licenses/fdl-1.3.htmlhttp://www.gnu.org/licenses/fdl-1.3.html
This dataset provides valuable insights into the persistency of drug prescriptions in the pharmaceutical industry. By analyzing various factors, we aim to build a classification model to understand the factors influencing persistency. The dataset includes patient information, provider attributes, clinical factors, and disease/treatment factors. The challenge is to uncover patterns and relationships that impact persistency. This analysis will aid pharmaceutical companies in optimizing their strategies and improving patient outcomes. Join me in exploring this dataset and leveraging machine learning techniques to tackle this important problem. Let's dive in and unlock the secrets of drug persistency!
One of the challenge for all Pharmaceutical companies is to understand the persistency of drug as per the physician prescription.
With an objective to gather insights on the factors that are impacting the persistency, build your own classification model.
Here I'm describing the columns in detail:
Variable: Variable Description Patient ID: Unique ID of each patient Persistency_Flag: Flag indicating if a patient was persistent or not Age: Age of the patient during their therapy Race: Race of the patient from the patient table Region: Region of the patient from the patient table Ethnicity: Ethnicity of the patient from the patient table Gender: Gender of the patient from the patient table IDN Indicator: Flag indicating patients mapped to IDN
NTM - Physician Specialty: Specialty of the HCP that prescribed the NTM Rx
NTM - T-Score: T Score of the patient at the time of the NTM Rx (within 2 years prior from rxdate) Change in T Score: Change in Tscore before starting with any therapy and after receiving therapy (Worsened, Remained Same, Improved, Unknown) NTM - Risk Segment: Risk Segment of the patient at the time of the NTM Rx (within 2 years days prior from rxdate) Change in Risk Segment: Change in Risk Segment before starting with any therapy and after receiving therapy (Worsened, Remained Same, Improved, Unknown) NTM - Multiple Risk Factors: Flag indicating if patient falls under multiple risk category (having more than 1 risk) at the time of the NTM Rx (within 365 days prior from rxdate) NTM - Dexa Scan Frequency: Number of DEXA scans taken prior to the first NTM Rx date (within 365 days prior from rxdate) NTM - Dexa Scan Recency: Flag indicating the presence of Dexa Scan before the NTM Rx (within 2 years prior from rxdate or between their first Rx and Switched Rx; whichever is smaller and applicable) Dexa During Therapy: Flag indicating if the patient had a Dexa Scan during their first continuous therapy NTM - Fragility Fracture Recency: Flag indicating if the patient had a recent fragility fracture (within 365 days prior from rxdate) Fragility Fracture During Therapy: Flag indicating if the patient had fragility fracture during their first continuous therapy NTM - Glucocorticoid Recency: Flag indicating usage of Glucocorticoids (>=7.5mg strength) in the one year look-back from the first NTM Rx Glucocorticoid During Therapy: Flag indicating if the patient had a Glucocorticoid usage during the first continuous therapy
NTM - Injectable Experience: Flag indicating any injectable drug usage in the recent 12 months before the NTM OP Rx NTM - Risk Factors: Risk Factors that the patient is falling into. For chronic Risk Factors complete lookback to be applied and for non-chronic Risk Factors, one year lookback from the date of first OP Rx NTM - Comorbidity: Comorbidities are divided into two main categories - Acute and chronic, based on the ICD codes. For chronic disease we are taking complete look back from the first Rx date of NTM therapy and for acute diseases, time period before the NTM OP Rx with one year lookback has been applied NTM - Concomitancy: Concomitant drugs recorded prior to starting with a therapy(within 365 days prior from first rxdate) Adherence: Adherence for the therapies
This is my first datasets in the Kaggle. Hope you will learn and make more notebooks from this datasets. If you learn something from this datasets then don't forget to upvote it.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
🇬🇧 English:
This synthetic dataset was created to simulate a typical bank credit approval process. It includes 1,000 applicant records with relevant financial and demographic details such as age, income, credit score, employment status, and requested loan amount. A final approved column indicates whether the credit application was accepted.
Use this dataset to:
🇹🇷 Türkçe:
Bu sentetik veri seti, bir bankanın kredi başvuru sürecini modellemek amacıyla oluşturulmuştur. 1.000 başvuru sahibine ait yaş, gelir, kredi puanı, istihdam durumu ve talep edilen kredi tutarı gibi bilgiler yer almaktadır. approved sütunu ise başvurunun onaylanıp onaylanmadığını belirtir.
Bu veri seti sayesinde:
This dataset was created by chinthakindi vinod