7 datasets found

Classification Analysis Using Python
kaggle.com
Updated Jul 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nibedita Sahu (2023). Classification Analysis Using Python [Dataset]. https://www.kaggle.com/datasets/nibeditasahu/classification-analysis-using-python
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 3, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nibedita Sahu
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The Iris dataset is a classic and widely used dataset in machine learning for classification tasks. It consists of measurements of different iris flowers, including sepal length, sepal width, petal length, and petal width, along with their corresponding species. With a total of 150 samples, the dataset is balanced and serves as an excellent choice for understanding and implementing classification algorithms. This notebook explores the dataset, preprocesses the data, builds a decision tree classification model, and evaluates its performance, showcasing the effectiveness of decision trees in solving classification problems.
Customer Churn - Decision Tree & Random Forest
kaggle.com
Updated Jul 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vikram amin (2023). Customer Churn - Decision Tree & Random Forest [Dataset]. https://www.kaggle.com/datasets/vikramamin/customer-churn-decision-tree-and-random-forest
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 6, 2023
Dataset provided by
Kaggle
Authors
vikram amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Main objective: Find out customers who will churn and who will not.

Methodology: It is a classification problem. We will use decision tree and random forest to predict the outcome.

Steps Involved

Read the data

Check for data types https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F1ffb600d8a4b4b36bc25e957524a3524%2FPicture1.png?generation=1688638600831386&alt=media" alt="">

Change character vector to factor vector as this is as classification problem

Drop the variable which is not significant for the analysis. We drop "customerID".

Check for missing values. None are found.

Split the data into train and test so we can use the train data for building the model and use test data for prediction. We split this into 80-20 ratio (train/test) using the sample function.

Install and run libraries (rpart, rpart.plot, rattle, RColorBrewer, caret)

Run decision tree using rpart function. The dependent variable is Churn and 19 other independent variables

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F8d3442e6c82d8026c6a448e4780ab38c%2FPicture2.png?generation=1688638685268853&alt=media" alt=""> 9. Plot the decision tree

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F9ab0591e323dc30fe116c79f6d014d06%2FPicture3.png?generation=1688638747644320&alt=media" alt="">

Average customer churn is 27%. The churn can take place if the tenure is more than >=7.5 and there is no internet service

Tuning the model

Define the search grid using the expand.grid function

Set up the control parameters through 5 fold cross validation

When we print the model we get the best CP = 0.01 and an accuracy of 79.00%

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F16080ac04d3743ec238227e1ef2c8269%2FPicture4.png?generation=1688639197455166&alt=media" alt="">

Predict the model

Find out the variables which are most and least significant. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F61beb4224e9351cfc772147c43800502%2FPicture5.png?generation=1688639468638950&alt=media" alt="">

Significant variables are Internet Service, Tenure and the least significant are Streaming Movies, Tech Support.

USE RANDOM FOREST

Run library(randomForest). Here we are using the default ntree (500) and mtry (p/3) where p is the number of independent variables. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc27fe7e83f0b53b7e067371b69c7f4a7%2FPicture6.png?generation=1688640478682685&alt=media" alt="">

Through confusion matrix, accuracy is coming 79.27%. The accuracy is marginally higher than that of decision tree i.e 79.00%. The error rate is pretty low when predicting "No" and much higher when predicting "Yes".

Plot the model showing which variables reduce the gini impunity the most and least. Total charges and tenure reduce the gini impunity the most while phone service has the least impact.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fec25fc3ba74ab9cef1a81188209512b1%2FPicture7.png?generation=1688640726235724&alt=media" alt="">

Predict the model and create a new data frame showing the actuals vs predicted values

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F50aa40e5dd676c8285020fd2fe627bf1%2FPicture8.png?generation=1688640896763066&alt=media" alt="">

Plot the model so as to find out where the OOB (out of bag ) error stops decreasing or becoming constant. As we can see that the error stops decreasing between 100 to 200 trees. So we decide to take ntree = 200 when we tune the model.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F87211e1b218c595911fbe6ea2806e27a%2FPicture9.png?generation=1688641103367564&alt=media" alt="">

Tune the model mtry=2 has the lowest OOB error rate

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6057af5bb0719b16f1a97a58c3d4aa1d%2FPicture10.png?generation=1688641391027971&alt=media" alt="">

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc7045eba4ee298c58f1bd0230c24c00d%2FPicture11.png?generation=1688641605829830&alt=media" alt="">

Use random forest with mtry = 2 and ntree = 200

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F01541eff1f9c6303591aa50dd707b5f5%2FPicture12.png?generation=1688641634979403&alt=media" alt="">

Through confusion matrix, accuracy is coming 79.71%. The accuracy is marginally higher than that of default (when ntree was 500 and mtry was 4) i.e 79.27% and of decision tree i.e 79.00%. The error rate is pretty low when predicting "No" and m...
Example Churn Data
kaggle.com
Updated Nov 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
George Hayduke (2021). Example Churn Data [Dataset]. https://www.kaggle.com/ban7002/example-churn-data/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 11, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
George Hayduke
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Basic telco churn dataset used to challenge students and academics

Content

Info Description
file churn_100k.csv
n_samples 101K
n_features 28
pct_missing 1%

suggested model features

numeric_features = ['monthly_minutes', 'customerServiceCalls', 'streaming_minutes', 'TotalBilled', 'PrevBalance', 'latePayments']

categorical_features = ['ip_address_asn', 'phone_area_code', 'customer_reg_date', 'email_domain', 'phoneModel', 'billing_city', 'billing_postal', 'billing_state', 'partner', 'PhoneService', 'MultipleLines', 'streamingPlan', 'mobileHotspot', 'wifiCallingText', 'OnlineBackup', 'device_protection', 'number_phones', 'contract_code', 'currency_code', 'maling_code', 'paperlessBilling', 'paymentMethod']

dataset performance

random sampling 70/15/15

Train AUC Score : 0.967279 Eval AUC Score : 0.958073 Test AUC Score : 0.946909

Inspiration

Fun and simple dataset to practice with.
Fruits-360 dataset
kaggle.com
paperswithcode.com
+1more
Updated Jun 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mihai Oltean (2025). Fruits-360 dataset [Dataset]. https://www.kaggle.com/datasets/moltean/fruits
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 7, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mihai Oltean
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Fruits-360 dataset: A dataset of images containing fruits, vegetables, nuts and seeds

Version: 2025.06.07.0

Content

The following fruits, vegetables and nuts and are included: Apples (different varieties: Crimson Snow, Golden, Golden-Red, Granny Smith, Pink Lady, Red, Red Delicious), Apricot, Avocado, Avocado ripe, Banana (Yellow, Red, Lady Finger), Beans, Beetroot Red, Blackberry, Blueberry, Cabbage, Caju seed, Cactus fruit, Cantaloupe (2 varieties), Carambula, Carrot, Cauliflower, Cherimoya, Cherry (different varieties, Rainier), Cherry Wax (Yellow, Red, Black), Chestnut, Clementine, Cocos, Corn (with husk), Cucumber (ripened, regular), Dates, Eggplant, Fig, Ginger Root, Goosberry, Granadilla, Grape (Blue, Pink, White (different varieties)), Grapefruit (Pink, White), Guava, Hazelnut, Huckleberry, Kiwi, Kaki, Kohlrabi, Kumsquats, Lemon (normal, Meyer), Lime, Lychee, Mandarine, Mango (Green, Red), Mangostan, Maracuja, Melon Piel de Sapo, Mulberry, Nectarine (Regular, Flat), Nut (Forest, Pecan), Onion (Red, White), Orange, Papaya, Passion fruit, Peach (different varieties), Pepino, Pear (different varieties, Abate, Forelle, Kaiser, Monster, Red, Stone, Williams), Pepper (Red, Green, Orange, Yellow), Physalis (normal, with Husk), Pineapple (normal, Mini), Pistachio, Pitahaya Red, Plum (different varieties), Pomegranate, Pomelo Sweetie, Potato (Red, Sweet, White), Quince, Rambutan, Raspberry, Redcurrant, Salak, Strawberry (normal, Wedge), Tamarillo, Tangelo, Tomato (different varieties, Maroon, Cherry Red, Yellow, not ripened, Heart), Walnut, Watermelon, Zucchini (green and dark).

Branches

The dataset has 5 major branches:

-The 100x100 branch, where all images have 100x100 pixels. See _fruits-360_100x100_ folder.

-The original-size branch, where all images are at their original (captured) size. See _fruits-360_original-size_ folder.

-The meta branch, which contains additional information about the objects in the Fruits-360 dataset. See _fruits-360_dataset_meta_ folder.

-The multi branch, which contains images with multiple fruits, vegetables, nuts and seeds. These images are not labeled. See _fruits-360_multi_ folder.

-The _3_body_problem_ branch where the Training and Test folders contain different (varieties of) the 3 fruits and vegetables (Apples, Cherries and Tomatoes). See _fruits-360_3-body-problem_ folder.

How to cite

Mihai Oltean, Fruits-360 dataset, 2017-

Dataset properties

For the 100x100 branch

Total number of images: 138704.

Training set size: 103993 images.

Test set size: 34711 images.

Number of classes: 206 (fruits, vegetables, nuts and seeds).

Image size: 100x100 pixels.

For the original-size branch

Total number of images: 58363.

Training set size: 29222 images.

Validation set size: 14614 images

Test set size: 14527 images.

Number of classes: 90 (fruits, vegetables, nuts and seeds).

Image size: various (original, captured, size) pixels.

For the 3-body-problem branch

Total number of images: 47033.

Training set size: 34800 images.

Test set size: 12233 images.

Number of classes: 3 (Apples, Cherries, Tomatoes).

Number of varieties: Apples = 29; Cherries = 12; Tomatoes = 19.

Image size: 100x100 pixels.

For the meta branch

Number of classes: 26 (fruits, vegetables, nuts and seeds).

For the multi branch

Number of images: 150.

Filename format:

For the 100x100 branch

image_index_100.jpg (e.g. 31_100.jpg) or

r_image_index_100.jpg (e.g. r_31_100.jpg) or

r?_image_index_100.jpg (e.g. r2_31_100.jpg)

where "r" stands for rotated fruit. "r2" means that the fruit was rotated around the 3rd axis. "100" comes from image size (100x100 pixels).

Different varieties of the same fruit (apple, for instance) are stored as belonging to different classes.

For the original-size branch

r?_image_index.jpg (e.g. r2_31.jpg)

where "r" stands for rotated fruit. "r2" means that the fruit was rotated around the 3rd axis.

The name of the image files in the new version does NOT contain the "_100" suffix anymore. This will help you to make the distinction between the original-size branch and the 100x100 branch.

For the multi branch

The file's name is the concatenation of the names of the fruits inside that picture.

Alternate download

The Fruits-360 dataset can be downloaded from:

Kaggle https://www.kaggle.com/moltean/fruits

GitHub https://github.com/fruits-360

How fruits were filmed

Fruits and vegetables were planted in the shaft of a low-speed motor (3 rpm) and a short movie of 20 seconds was recorded.

A Logitech C920 camera was used for filming the fruits. This is one of the best webcams available.

Behind the fruits, we placed a white sheet of paper as a background.

Here i...
ATIS Dataset Clean Resplit
kaggle.com
Updated Feb 8, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
kpe (2019). ATIS Dataset Clean Resplit [Dataset]. https://www.kaggle.com/siddhadev/atis-dataset-clean/tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 8, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
kpe
Description
ATIS DataSet

The ATIS dataset is a standard benchmark dataset widely used as an intent classification and slot filling task.

Content

The data is imported from https://github.com/yvchen/JointSLU but is cleaned and resplitted by removing duplicated samples and uncommon labels (see the kaggle kernel atis-dataset-clean-re-split-kernel used for creating this dataset for additional detalis).

Acknowledgements

Thanks to Yun-Nung (Vivian) Chen for publishing the original dataset.

I have previously found a version of the ATIS dataset in the MS CNTK and have written a converter as a jupyter notebook kpe/notebooks to make the dataset easily accessible in python. However I much more like the simplicity of text format representation used by yvchen/JointSLU.
Iris Species Dataset and Database
kaggle.com
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ghanshyam Saini (2025). Iris Species Dataset and Database [Dataset]. https://www.kaggle.com/datasets/ghnshymsaini/iris-species-dataset-and-database
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 15, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ghanshyam Saini
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Iris Flower Dataset

This is a classic and very widely used dataset in machine learning and statistics, often serving as a first dataset for classification problems. Introduced by the British statistician and biologist Ronald Fisher in his 1936 paper "The use of multiple measurements in taxonomic problems," it is a foundational resource for learning classification algorithms.

Overview:

The dataset contains measurements for 150 samples of iris flowers. Each sample belongs to one of three species of iris:

Iris setosa

Iris versicolor

Iris virginica

For each flower, four features were measured:

Sepal length (in cm)

Sepal width (in cm)

Petal length (in cm)

Petal width (in cm)

The goal is typically to build a model that can classify iris flowers into their correct species based on these four features.

File Structure:

The dataset is usually provided as a single CSV (Comma Separated Values) file, often named iris.csv or similar. This file typically contains the following columns:

sepal_length (cm): Numerical. The length of the sepal of the iris flower.

sepal_width (cm): Numerical. The width of the sepal of the iris flower.

petal_length (cm): Numerical. The length of the petal of the iris flower.

petal_width (cm): Numerical. The width of the petal of the iris flower.

species: Categorical. The species of the iris flower (either 'setosa', 'versicolor', or 'virginica'). This is the target variable for classification.

Content of the Data:

The dataset contains an equal number of samples (50) for each of the three iris species. The measurements of the sepal and petal dimensions vary between the species, allowing for their differentiation using machine learning models.

How to Use This Dataset:

Download the iris.csv file.

Load the data using libraries like Pandas in Python.

Explore the data through visualization and statistical analysis to understand the relationships between the features and the different species.

Build classification models (e.g., Logistic Regression, Support Vector Machines, Decision Trees, K-Nearest Neighbors) using the sepal and petal measurements as features and the 'species' column as the target variable.

Evaluate the performance of your model using appropriate metrics (e.g., accuracy, precision, recall, F1-score).

The dataset is small and well-behaved, making it excellent for learning and experimenting with various classification techniques.

Citation:

When using the Iris dataset, it is common to cite Ronald Fisher's original work:

Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179-188.

Data Contribution:

Thank you for providing this classic and fundamental dataset to the Kaggle community. The Iris dataset remains an invaluable resource for both beginners learning the basics of classification and experienced practitioners testing new algorithms. Its simplicity and clear class separation make it an ideal starting point for many data science projects.

If you find this dataset description helpful and the dataset itself useful for your learning or projects, please consider giving it an upvote after downloading. Your appreciation is valuable!
Listening-to-Earthquakes
kaggle.com
Updated Apr 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PenguinGUI (2025). Listening-to-Earthquakes [Dataset]. https://www.kaggle.com/datasets/penguingui/listening-to-earthquakes/versions/3
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 30, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
PenguinGUI
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Data Card: LANL Earthquake Prediction

1. Data Source and Description

Source: The data originates from the LANL Earthquake Prediction Kaggle competition, provided by Los Alamos National Laboratory.

Description: The dataset comprises continuous seismic acoustic data collected from laboratory experiments simulating earthquake conditions. The objective is to predict the "time to failure"—the time remaining until the next laboratory earthquake occurs—making this a regression task.

Licensing and Ethical Considerations: The data is publicly available under the competition’s terms of use on Kaggle. Ethically, any models or insights derived should be applied responsibly, considering the potential implications of earthquake prediction in real-world scenarios.

2. Data Preprocessing

Sliding-Window Approach: Given the large size and continuous nature of the seismic data (each segment contains 150,000 data points), training time series models like Temporal Convolutional Networks (TCN) or Long Short-Term Memory (LSTM) networks on full-length samples is computationally intensive. To address this, a sliding-window strategy was implemented:

Window Sizes: Multiple sizes were used—150,000, 15,000, and 1,500 data points—to capture patterns across different temporal scales.

Strides: Strides of 150,000, 15,000, 7,500, 1,500, and 750 were applied, creating overlapping or non-overlapping windows and resulting in five distinct processed datasets.

Data Segmentation: The continuous acoustic signals were segmented into fixed-length chunks of 150,000 data points each. For training data, the label assigned to each segment was the "time to failure" at its endpoint, aligning with the prediction task.

Data Storage: Extracted feature sequences were saved in NumPy’s compressed .npz format, ensuring efficient storage, accessibility, and consistency across training and testing phases.

3. Feature Extraction

Statistical Features: For each window, eleven statistical features were calculated to summarize the seismic signals:

Mean, standard deviation, minimum, maximum, median, skewness, and kurtosis.

Quantile-based features at 1%, 5%, 95%, and 99% to detect variations and potential anomalies linked to seismic events.

Advanced Techniques: Drawing from top competition solutions (e.g., the 26th place approach), consider enhancing your feature set with:

Matched Filtering: Identifies recurring patterns in the time series, which could signal precursors to earthquakes.

Hilbert Transform: Extracts the analytical envelope of the signal, highlighting peak behaviors and dynamic changes.

Multi-Scale Analysis: Using varied window sizes and strides enables the capture of both short-term fluctuations and longer-term trends, critical for understanding seismic signal dynamics.

4. Predictive Modeling

Model Selection: A diverse set of models was planned to leverage their unique strengths:

Random Forest: Offers a baseline and insights into feature importance.

Multi-Layer Perceptron (MLP): Provides a simple neural network baseline.

Temporal Convolutional Network (TCN): Excels at modeling temporal dependencies with computational efficiency.

Long Short-Term Memory (LSTM): Captures long-term sequential relationships, ideal for time series like seismic data.

Ensemble Potential: The competition’s winning team combined neural networks with gradient boosted decision trees (e.g., LightGBM or XGBoost), suggesting that integrating tree-based models could boost performance.

Training Considerations:

Use cross-validation strategies tailored to the data’s structure, such as "Leave One Earthquake Out," to ensure generalization (inspired by the 1st place writeup).

Tune hyperparameters carefully, especially for tree-based models, to optimize for the target metric.

5. Evaluation

Metric: The competition evaluates submissions using Mean Absolute Error (MAE), measuring the accuracy of predicted "time to failure" against actual values.

Validation: Robust validation is key. Consider splitting the data by earthquake cycles to mimic the test set’s structure and avoid overfitting.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Info	Description
file	churn_100k.csv
n_samples	101K
n_features	28
pct_missing	1%

Facebook

Twitter

Click to copy link

Link copied

Cite

Nibedita Sahu (2023). Classification Analysis Using Python [Dataset]. https://www.kaggle.com/datasets/nibeditasahu/classification-analysis-using-python

Classification Analysis Using Python

Exploring the Iris Dataset and Building a Decision Tree Classifier

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jul 3, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Nibedita Sahu

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

The Iris dataset is a classic and widely used dataset in machine learning for classification tasks. It consists of measurements of different iris flowers, including sepal length, sepal width, petal length, and petal width, along with their corresponding species. With a total of 150 samples, the dataset is balanced and serves as an excellent choice for understanding and implementing classification algorithms. This notebook explores the dataset, preprocesses the data, builds a decision tree classification model, and evaluates its performance, showcasing the effectiveness of decision trees in solving classification problems.

Clear search

Close search

Google apps

Main menu

Classification Analysis Using Python

Customer Churn - Decision Tree & Random Forest

USE RANDOM FOREST

Example Churn Data

Context

Content

suggested model features

dataset performance

random sampling 70/15/15

Inspiration

Fruits-360 dataset

Fruits-360 dataset: A dataset of images containing fruits, vegetables, nuts and seeds

Version: 2025.06.07.0

Content

Branches

How to cite

Dataset properties

For the 100x100 branch

For the original-size branch

For the 3-body-problem branch

For the meta branch

For the multi branch

Filename format:

For the 100x100 branch

For the original-size branch

For the multi branch

Alternate download

How fruits were filmed

ATIS Dataset Clean Resplit

ATIS DataSet

Content

Acknowledgements

Iris Species Dataset and Database

Iris Flower Dataset

Listening-to-Earthquakes

Data Card: LANL Earthquake Prediction

1. Data Source and Description

2. Data Preprocessing

3. Feature Extraction

4. Predictive Modeling

5. Evaluation

Classification Analysis Using Python

Exploring the Iris Dataset and Building a Decision Tree Classifier