100+ datasets found

Ecommerce Dataset for Data Analysis
kaggle.com
zip
Updated Sep 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shrishti Manja (2024). Ecommerce Dataset for Data Analysis [Dataset]. https://www.kaggle.com/datasets/shrishtimanja/ecommerce-dataset-for-data-analysis/code
Explore at:
zip(2028853 bytes)Available download formats
Dataset updated
Sep 19, 2024
Authors
Shrishti Manja
Description
This dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.

About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.

Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.

This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.

This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning
Cleaning Code and Preprocessing
kaggle.com
zip
Updated Feb 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MohameddAteff (2024). Cleaning Code and Preprocessing [Dataset]. https://www.kaggle.com/datasets/mohameddateff/cleaning-code-and-preprocessing
Explore at:
zip(70478 bytes)Available download formats
Dataset updated
Feb 7, 2024
Authors
MohameddAteff
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by MohameddAteff

Released under MIT

Contents
Excel Data Cleaning - Montgomery Fleet Inventory
kaggle.com
zip
Updated Feb 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ibrahimryk (2025). Excel Data Cleaning - Montgomery Fleet Inventory [Dataset]. https://www.kaggle.com/datasets/ibrahimryk/excel-data-cleaning-montgomery-fleet-inventory/data
Explore at:
zip(10139 bytes)Available download formats
Dataset updated
Feb 9, 2025
Authors
Ibrahimryk
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset contains a cleaned version of the Montgomery County Fleet Equipment Inventory.

✅ Data Cleaning Steps: - Removed duplicate records - Fixed spelling errors - Merged department names using Flash Fill - Removed unnecessary whitespace - Converted CSV to Excel (.XLSX) format

📂 Original Dataset Source: Montgomery County Public Dataset
S1 Data -
plos.figshare.com
zip
Updated Oct 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang (2023). S1 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0292466.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0292466.s001
Dataset updated
Oct 11, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analyzing customers’ characteristics and giving the early warning of customer churn based on machine learning algorithms, can help enterprises provide targeted marketing strategies and personalized services, and save a lot of operating costs. Data cleaning, oversampling, data standardization and other preprocessing operations are done on 900,000 telecom customer personal characteristics and historical behavior data set based on Python language. Appropriate model parameters were selected to build BPNN (Back Propagation Neural Network). Random Forest (RF) and Adaboost, the two classic ensemble learning models were introduced, and the Adaboost dual-ensemble learning model with RF as the base learner was put forward. The four models and the other four classical machine learning models-decision tree, naive Bayes, K-Nearest Neighbor (KNN), Support Vector Machine (SVM) were utilized respectively to analyze the customer churn data. The results show that the four models have better performance in terms of recall rate, precision rate, F1 score and other indicators, and the RF-Adaboost dual-ensemble model has the best performance. Among them, the recall rates of BPNN, RF, Adaboost and RF-Adaboost dual-ensemble model on positive samples are respectively 79%, 90%, 89%,93%, the precision rates are 97%, 99%, 98%, 99%, and the F1 scores are 87%, 95%, 94%, 96%. The RF-Adaboost dual-ensemble model has the best performance, and the three indicators are 10%, 1%, and 6% higher than the reference. The prediction results of customer churn provide strong data support for telecom companies to adopt appropriate retention strategies for pre-churn customers and reduce customer churn.
Advanced Health Intelligence (AHI): Revolutionizing Healthcare through AI...
kappasignal.com
Updated Jan 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KappaSignal (2024). Advanced Health Intelligence (AHI): Revolutionizing Healthcare through AI and Data Analytics? (Forecast) [Dataset]. https://www.kappasignal.com/2024/01/advanced-health-intelligence-ahi.html
Explore at:
Dataset updated
Jan 15, 2024
Dataset authored and provided by
KappaSignal
License
https://www.kappasignal.com/p/legal-disclaimer.htmlhttps://www.kappasignal.com/p/legal-disclaimer.html
Description
This analysis presents a rigorous exploration of financial data, incorporating a diverse range of statistical features. By providing a robust foundation, it facilitates advanced research and innovative modeling techniques within the field of finance.

Advanced Health Intelligence (AHI): Revolutionizing Healthcare through AI and Data Analytics?

Financial data:

Historical daily stock prices (open, high, low, close, volume)

Fundamental data (e.g., market capitalization, price to earnings P/E ratio, dividend yield, earnings per share EPS, price to earnings growth, debt-to-equity ratio, price-to-book ratio, current ratio, free cash flow, projected earnings growth, return on equity, dividend payout ratio, price to sales ratio, credit rating)

Technical indicators (e.g., moving averages, RSI, MACD, average directional index, aroon oscillator, stochastic oscillator, on-balance volume, accumulation/distribution A/D line, parabolic SAR indicator, bollinger bands indicators, fibonacci, williams percent range, commodity channel index)

Machine learning features:

Feature engineering based on financial data and technical indicators

Sentiment analysis data from social media and news articles

Macroeconomic data (e.g., GDP, unemployment rate, interest rates, consumer spending, building permits, consumer confidence, inflation, producer price index, money supply, home sales, retail sales, bond yields)

Potential Applications:

Stock price prediction

Portfolio optimization

Algorithmic trading

Market sentiment analysis

Risk management

Use Cases:

Researchers investigating the effectiveness of machine learning in stock market prediction

Analysts developing quantitative trading Buy/Sell strategies

Individuals interested in building their own stock market prediction models

Students learning about machine learning and financial applications

Additional Notes:

The dataset may include different levels of granularity (e.g., daily, hourly)

Data cleaning and preprocessing are essential before model training

Regular updates are recommended to maintain the accuracy and relevance of the data
m
Digital game addiction data version 2
data.mendeley.com
kaggle.com
Updated Aug 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esra Kahya-Ozyirmidokuz (2024). Digital game addiction data version 2 [Dataset]. http://doi.org/10.17632/7z75yjs8zg.1
Explore at:
Unique identifier
https://doi.org/10.17632/7z75yjs8zg.1
Dataset updated
Aug 6, 2024
Authors
Esra Kahya-Ozyirmidokuz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We have revised the dataset, ensuring that it is thoroughly processed and ready for analysis. The attached second dataset has undergone comprehensive preprocessing algorithms. This preprocessing includes steps such as data cleaning, normalization, and feature extraction to enhance the quality and usability of the data. These steps are crucial to ensure that the dataset is free from inconsistencies, missing values, and irrelevant information, thereby improving the accuracy and reliability of the subsequent machine learning models.
R
Cdd Dataset
universe.roboflow.com
zip
Updated Sep 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
hakuna matata (2023). Cdd Dataset [Dataset]. https://universe.roboflow.com/hakuna-matata/cdd-g8a6g/model/3
Explore at:
zipAvailable download formats
Dataset updated
Sep 5, 2023
Dataset authored and provided by
hakuna matata
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Cumcumber Diease Detection Bounding Boxes
Description
Project Documentation: Cucumber Disease Detection

Title and Introduction Title: Cucumber Disease Detection

Introduction: A machine learning model for the automatic detection of diseases in cucumber plants is to be developed as part of the "Cucumber Disease Detection" project. This research is crucial because it tackles the issue of early disease identification in agriculture, which can increase crop yield and cut down on financial losses. To train and test the model, we use a dataset of pictures of cucumber plants.

Problem Statement Problem Definition: The research uses image analysis methods to address the issue of automating the identification of diseases, including Downy Mildew, in cucumber plants. Effective disease management in agriculture depends on early illness identification.

Importance: Early disease diagnosis helps minimize crop losses, stop the spread of diseases, and better allocate resources in farming. Agriculture is a real-world application of this concept.

Goals and Objectives: Develop a machine learning model to classify cucumber plant images into healthy and diseased categories. Achieve a high level of accuracy in disease detection. Provide a tool for farmers to detect diseases early and take appropriate action.

Data Collection and Preprocessing Data Sources: The dataset comprises of pictures of cucumber plants from various sources, including both healthy and damaged specimens.

Data Collection: Using cameras and smartphones, images from agricultural areas were gathered.

Data Preprocessing: Data cleaning to remove irrelevant or corrupted images. Handling missing values, if any, in the dataset. Removing outliers that may negatively impact model training. Data augmentation techniques applied to increase dataset diversity.

Exploratory Data Analysis (EDA) The dataset was examined using visuals like scatter plots and histograms. The data was examined for patterns, trends, and correlations. Understanding the distribution of photos of healthy and ill plants was made easier by EDA.

Methodology Machine Learning Algorithms:

Convolutional Neural Networks (CNNs) were chosen for image classification due to their effectiveness in handling image data. Transfer learning using pre-trained models such as ResNet or MobileNet may be considered. Train-Test Split:

The dataset was split into training and testing sets with a suitable ratio. Cross-validation may be used to assess model performance robustly.

Model Development The CNN model's architecture consists of layers, units, and activation operations. On the basis of experimentation, hyperparameters including learning rate, batch size, and optimizer were chosen. To avoid overfitting, regularization methods like dropout and L2 regularization were used.

Model Training During training, the model was fed the prepared dataset across a number of epochs. The loss function was minimized using an optimization method. To ensure convergence, early halting and model checkpoints were used.

Model Evaluation Evaluation Metrics:

Accuracy, precision, recall, F1-score, and confusion matrix were used to assess model performance. Results were computed for both training and test datasets. Performance Discussion:

The model's performance was analyzed in the context of disease detection in cucumber plants. Strengths and weaknesses of the model were identified.

Results and Discussion Key project findings include model performance and disease detection precision. a comparison of the many models employed, showing the benefits and drawbacks of each. challenges that were faced throughout the project and the methods used to solve them.

Conclusion recap of the project's key learnings. the project's importance to early disease detection in agriculture should be highlighted. Future enhancements and potential research directions are suggested.

References Library: Pillow,Roboflow,YELO,Sklearn,matplotlib Datasets:https://data.mendeley.com/datasets/y6d3z6f8z9/1

Code Repository https://universe.roboflow.com/hakuna-matata/cdd-g8a6g

Rafiur Rahman Rafit EWU 2018-3-60-111
Medical Clean Dataset
kaggle.com
zip
Updated Jul 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aamir Shahzad (2025). Medical Clean Dataset [Dataset]. https://www.kaggle.com/datasets/aamir5659/medical-clean-dataset
Explore at:
zip(1262 bytes)Available download formats
Dataset updated
Jul 6, 2025
Authors
Aamir Shahzad
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This is the cleaned version of a real-world medical dataset that was originally noisy, incomplete, and contained various inconsistencies. The dataset was cleaned through a structured and well-documented data preprocessing pipeline using Python and Pandas. Key steps in the cleaning process included:

Handling missing values using statistical techniques such as median imputation and mode replacement

Converting categorical values to consistent formats (e.g., gender formatting, yes/no standardization)

Removing duplicate entries to ensure data accuracy

Parsing and standardizing date fields

Creating new derived features such as age groups

Detecting and reviewing outliers based on IQR

Removing irrelevant or redundant columns

The purpose of cleaning this dataset was to prepare it for further exploratory data analysis (EDA), data visualization, and machine learning modeling.

This cleaned dataset is now ready for training predictive models, generating visual insights, or conducting healthcare-related research. It provides a high-quality foundation for anyone interested in medical analytics or data science practice.
Quadrise Fuels: A Burning Question for (QED) Stock (Forecast)
kappasignal.com
Updated Sep 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KappaSignal (2024). Quadrise Fuels: A Burning Question for (QED) Stock (Forecast) [Dataset]. https://www.kappasignal.com/2024/09/quadrise-fuels-burning-question-for-qed.html
Explore at:
Dataset updated
Sep 25, 2024
Dataset authored and provided by
KappaSignal
License
https://www.kappasignal.com/p/legal-disclaimer.htmlhttps://www.kappasignal.com/p/legal-disclaimer.html
Description
This analysis presents a rigorous exploration of financial data, incorporating a diverse range of statistical features. By providing a robust foundation, it facilitates advanced research and innovative modeling techniques within the field of finance.

Quadrise Fuels: A Burning Question for (QED) Stock

Financial data:

Historical daily stock prices (open, high, low, close, volume)

Fundamental data (e.g., market capitalization, price to earnings P/E ratio, dividend yield, earnings per share EPS, price to earnings growth, debt-to-equity ratio, price-to-book ratio, current ratio, free cash flow, projected earnings growth, return on equity, dividend payout ratio, price to sales ratio, credit rating)

Technical indicators (e.g., moving averages, RSI, MACD, average directional index, aroon oscillator, stochastic oscillator, on-balance volume, accumulation/distribution A/D line, parabolic SAR indicator, bollinger bands indicators, fibonacci, williams percent range, commodity channel index)

Machine learning features:

Feature engineering based on financial data and technical indicators

Sentiment analysis data from social media and news articles

Macroeconomic data (e.g., GDP, unemployment rate, interest rates, consumer spending, building permits, consumer confidence, inflation, producer price index, money supply, home sales, retail sales, bond yields)

Potential Applications:

Stock price prediction

Portfolio optimization

Algorithmic trading

Market sentiment analysis

Risk management

Use Cases:

Researchers investigating the effectiveness of machine learning in stock market prediction

Analysts developing quantitative trading Buy/Sell strategies

Individuals interested in building their own stock market prediction models

Students learning about machine learning and financial applications

Additional Notes:

The dataset may include different levels of granularity (e.g., daily, hourly)

Data cleaning and preprocessing are essential before model training

Regular updates are recommended to maintain the accuracy and relevance of the data
Oncolytics Biotech Inc. Forecast & Analysis (Forecast)
kappasignal.com
Updated Aug 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KappaSignal (2023). Oncolytics Biotech Inc. Forecast & Analysis (Forecast) [Dataset]. https://www.kappasignal.com/2023/08/oncolytics-biotech-inc-forecast-analysis.html
Explore at:
Dataset updated
Aug 16, 2023
Dataset authored and provided by
KappaSignal
License
https://www.kappasignal.com/p/legal-disclaimer.htmlhttps://www.kappasignal.com/p/legal-disclaimer.html
Description
This analysis presents a rigorous exploration of financial data, incorporating a diverse range of statistical features. By providing a robust foundation, it facilitates advanced research and innovative modeling techniques within the field of finance.

Oncolytics Biotech Inc. Forecast & Analysis

Financial data:

Historical daily stock prices (open, high, low, close, volume)

Fundamental data (e.g., market capitalization, price to earnings P/E ratio, dividend yield, earnings per share EPS, price to earnings growth, debt-to-equity ratio, price-to-book ratio, current ratio, free cash flow, projected earnings growth, return on equity, dividend payout ratio, price to sales ratio, credit rating)

Technical indicators (e.g., moving averages, RSI, MACD, average directional index, aroon oscillator, stochastic oscillator, on-balance volume, accumulation/distribution A/D line, parabolic SAR indicator, bollinger bands indicators, fibonacci, williams percent range, commodity channel index)

Machine learning features:

Feature engineering based on financial data and technical indicators

Sentiment analysis data from social media and news articles

Macroeconomic data (e.g., GDP, unemployment rate, interest rates, consumer spending, building permits, consumer confidence, inflation, producer price index, money supply, home sales, retail sales, bond yields)

Potential Applications:

Stock price prediction

Portfolio optimization

Algorithmic trading

Market sentiment analysis

Risk management

Use Cases:

Researchers investigating the effectiveness of machine learning in stock market prediction

Analysts developing quantitative trading Buy/Sell strategies

Individuals interested in building their own stock market prediction models

Students learning about machine learning and financial applications

Additional Notes:

The dataset may include different levels of granularity (e.g., daily, hourly)

Data cleaning and preprocessing are essential before model training

Regular updates are recommended to maintain the accuracy and relevance of the data
Probabilistic AI: A New Approach to Artificial Intelligence (Forecast)
kappasignal.com
Updated May 27, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KappaSignal (2023). Probabilistic AI: A New Approach to Artificial Intelligence (Forecast) [Dataset]. https://www.kappasignal.com/2023/05/probabilistic-ai-new-approach-to.html
Explore at:
Dataset updated
May 27, 2023
Dataset authored and provided by
KappaSignal
License
https://www.kappasignal.com/p/legal-disclaimer.htmlhttps://www.kappasignal.com/p/legal-disclaimer.html
Description
This analysis presents a rigorous exploration of financial data, incorporating a diverse range of statistical features. By providing a robust foundation, it facilitates advanced research and innovative modeling techniques within the field of finance.

Probabilistic AI: A New Approach to Artificial Intelligence

Financial data:

Historical daily stock prices (open, high, low, close, volume)

Fundamental data (e.g., market capitalization, price to earnings P/E ratio, dividend yield, earnings per share EPS, price to earnings growth, debt-to-equity ratio, price-to-book ratio, current ratio, free cash flow, projected earnings growth, return on equity, dividend payout ratio, price to sales ratio, credit rating)

Technical indicators (e.g., moving averages, RSI, MACD, average directional index, aroon oscillator, stochastic oscillator, on-balance volume, accumulation/distribution A/D line, parabolic SAR indicator, bollinger bands indicators, fibonacci, williams percent range, commodity channel index)

Machine learning features:

Feature engineering based on financial data and technical indicators

Sentiment analysis data from social media and news articles

Macroeconomic data (e.g., GDP, unemployment rate, interest rates, consumer spending, building permits, consumer confidence, inflation, producer price index, money supply, home sales, retail sales, bond yields)

Potential Applications:

Stock price prediction

Portfolio optimization

Algorithmic trading

Market sentiment analysis

Risk management

Use Cases:

Researchers investigating the effectiveness of machine learning in stock market prediction

Analysts developing quantitative trading Buy/Sell strategies

Individuals interested in building their own stock market prediction models

Students learning about machine learning and financial applications

Additional Notes:

The dataset may include different levels of granularity (e.g., daily, hourly)

Data cleaning and preprocessing are essential before model training

Regular updates are recommended to maintain the accuracy and relevance of the data
S
The desensitized dataset of online comments about the autonomous vehicle...
scidb.cn
Updated Jul 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiang jilin (2025). The desensitized dataset of online comments about the autonomous vehicle "Apollo Go" [Dataset]. http://doi.org/10.57760/sciencedb.27758
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.27758
Dataset updated
Jul 10, 2025
Dataset provided by
Science Data Bank
Authors
Jiang jilin
License
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Description
This study systematically collected user comments related to the topic "Apollo Go" on the Douyin platform using Python-based automated web scraping technology. By developing efficient scraping scripts, a large volume of user interaction data was automatically gathered. After rigorous data cleaning and preprocessing, a dataset containing 5,985 valid comments was constructed.During the data cleaning process, all personally identifiable information was anonymized to ensure compliance and data security. Sensitive fields such as usernames and geographic locations were removed. The final dataset retains the following two fields:Time: Records the exact timestamp when each comment was posted, formatted as "2024/7/13 20:42:55", accurate to the second, facilitating subsequent time-series analysis.Comment: Contains the original user-generated text, preserved in its raw form, suitable for natural language processing tasks such as sentiment analysis and topic modeling.This dataset is well-structured and authentic, making it suitable for various applications including social media public opinion analysis, public sentiment monitoring, and research on topic dissemination pathways.
Can we predict stock market using machine learning? (FZO Stock Forecast)...
kappasignal.com
Updated Nov 21, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KappaSignal (2022). Can we predict stock market using machine learning? (FZO Stock Forecast) (Forecast) [Dataset]. https://www.kappasignal.com/2022/11/can-we-predict-stock-market-using_20.html
Explore at:
Dataset updated
Nov 21, 2022
Dataset authored and provided by
KappaSignal
License
https://www.kappasignal.com/p/legal-disclaimer.htmlhttps://www.kappasignal.com/p/legal-disclaimer.html
Description
This analysis presents a rigorous exploration of financial data, incorporating a diverse range of statistical features. By providing a robust foundation, it facilitates advanced research and innovative modeling techniques within the field of finance.

Can we predict stock market using machine learning? (FZO Stock Forecast)

Financial data:

Historical daily stock prices (open, high, low, close, volume)

Fundamental data (e.g., market capitalization, price to earnings P/E ratio, dividend yield, earnings per share EPS, price to earnings growth, debt-to-equity ratio, price-to-book ratio, current ratio, free cash flow, projected earnings growth, return on equity, dividend payout ratio, price to sales ratio, credit rating)

Technical indicators (e.g., moving averages, RSI, MACD, average directional index, aroon oscillator, stochastic oscillator, on-balance volume, accumulation/distribution A/D line, parabolic SAR indicator, bollinger bands indicators, fibonacci, williams percent range, commodity channel index)

Machine learning features:

Feature engineering based on financial data and technical indicators

Sentiment analysis data from social media and news articles

Macroeconomic data (e.g., GDP, unemployment rate, interest rates, consumer spending, building permits, consumer confidence, inflation, producer price index, money supply, home sales, retail sales, bond yields)

Potential Applications:

Stock price prediction

Portfolio optimization

Algorithmic trading

Market sentiment analysis

Risk management

Use Cases:

Researchers investigating the effectiveness of machine learning in stock market prediction

Analysts developing quantitative trading Buy/Sell strategies

Individuals interested in building their own stock market prediction models

Students learning about machine learning and financial applications

Additional Notes:

The dataset may include different levels of granularity (e.g., daily, hourly)

Data cleaning and preprocessing are essential before model training

Regular updates are recommended to maintain the accuracy and relevance of the data
Confusion matrix.
plos.figshare.com
xls
Updated Oct 11, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang (2023). Confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0292466.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0292466.t005
Dataset updated
Oct 11, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analyzing customers’ characteristics and giving the early warning of customer churn based on machine learning algorithms, can help enterprises provide targeted marketing strategies and personalized services, and save a lot of operating costs. Data cleaning, oversampling, data standardization and other preprocessing operations are done on 900,000 telecom customer personal characteristics and historical behavior data set based on Python language. Appropriate model parameters were selected to build BPNN (Back Propagation Neural Network). Random Forest (RF) and Adaboost, the two classic ensemble learning models were introduced, and the Adaboost dual-ensemble learning model with RF as the base learner was put forward. The four models and the other four classical machine learning models-decision tree, naive Bayes, K-Nearest Neighbor (KNN), Support Vector Machine (SVM) were utilized respectively to analyze the customer churn data. The results show that the four models have better performance in terms of recall rate, precision rate, F1 score and other indicators, and the RF-Adaboost dual-ensemble model has the best performance. Among them, the recall rates of BPNN, RF, Adaboost and RF-Adaboost dual-ensemble model on positive samples are respectively 79%, 90%, 89%,93%, the precision rates are 97%, 99%, 98%, 99%, and the F1 scores are 87%, 95%, 94%, 96%. The RF-Adaboost dual-ensemble model has the best performance, and the three indicators are 10%, 1%, and 6% higher than the reference. The prediction results of customer churn provide strong data support for telecom companies to adopt appropriate retention strategies for pre-churn customers and reduce customer churn.
Predicting Ventilator-Associated Pneumonia in ICU Patients with Type 2...
figshare.com
docx
Updated Nov 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shi Aoxing (2025). Predicting Ventilator-Associated Pneumonia in ICU Patients with Type 2 Diabetes — Data Preprocessing, Baseline Features, Correlation Analysis, Model Evaluation, and the TRIPOD-AI Guideline [Dataset]. http://doi.org/10.6084/m9.figshare.30454706.v6
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.30454706.v6
Dataset updated
Nov 1, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Shi Aoxing
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains the baseline characteristics and supplementary data from a study of ICU patients with type 2 diabetes mellitus (T2DM), aiming to predict ventilator-associated pneumonia (VAP) using machine learning.The baseline characteristics table summarizes patient demographics, vital signs, and laboratory measurements. Supplementary figures illustrate the data preprocessing steps (histograms and boxplots before and after interquartile range cleaning), missing value imputation using the Random Forest method, variable correlation analysis (Spearman correlation heatmap), and model evaluation (confusion matrices of four predictive models). In addition, the dataset includes a file summarizing the TRIPOD-AI guideline used for model reporting. These data provide a detailed overview of feature selection, data cleaning procedures, and model performance assessment.Fig. S1. Histograms and boxplots of Glucose_max and SBP_max in original and cleaned datasets: Glusco_max, maximum blood glucose; SBP_max, maximum systolic blood pressure. (A) original Glusco_max; (B) cleaned Glusco_max; (C) original SBP_max; (D) cleaned SBP_max.Fig. S2. Histograms and boxplots of Temp_min and WBC_min in original and cleaned datasets: Temp_min, minimum body temperature; WBC_min, minimum white blood cell count.(A)original Temp_min; (B)cleaned Temp_min; (C)original WBC_min; (D)cleaned WBC_min.Fig. S3. Histograms of PH_max and PH_min in original and Random Forest–imputed datasets: PH_max, maximum pH; PH_min, minimum pH.Fig. S4. Histograms of PO2_max and PO2_min in original and Random Forest–imputed datasets: PO2_max, maximum partial pressure of oxygen; PO2_min, minimum partial pressure of oxygen.Fig. S5. Histograms of PT_max and PT_min in original and Random Forest–imputed datasets: PT_max, maximum prothrombin time; PT_min, minimum prothrombin time.Fig. S6. Spearman correlation heatmap of variables selected by both the Boruta algorithm and LASSO regression:Hypertension, history of hypertension; Temp_min, minimum body temperature; Glusco_max, maximum blood glucose; Scr_max, maximum serum creatinine; WBC_min, minimum white blood cell count;CNS, SOFA neurological subscore; Renal, SOFA renal subscore; and GCS, Glasgow Coma Scale.Fig. S7. Confusion matrices of four predictive models: (A) Logistic Regression, (B) Random Forest, (C) XGBoost, and (D) Gradient Boosting Machine (GBM). Each matrix presents the counts of true positives, true negatives, false positives, and false negatives, facilitating model performance comparison.
How accurate is machine learning in stock market? (SVT Stock Forecast)...
kappasignal.com
Updated Nov 16, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KappaSignal (2022). How accurate is machine learning in stock market? (SVT Stock Forecast) (Forecast) [Dataset]. https://www.kappasignal.com/2022/11/how-accurate-is-machine-learning-in_45.html
Explore at:
Dataset updated
Nov 16, 2022
Dataset authored and provided by
KappaSignal
License
https://www.kappasignal.com/p/legal-disclaimer.htmlhttps://www.kappasignal.com/p/legal-disclaimer.html
Description
This analysis presents a rigorous exploration of financial data, incorporating a diverse range of statistical features. By providing a robust foundation, it facilitates advanced research and innovative modeling techniques within the field of finance.

How accurate is machine learning in stock market? (SVT Stock Forecast)

Financial data:

Historical daily stock prices (open, high, low, close, volume)

Fundamental data (e.g., market capitalization, price to earnings P/E ratio, dividend yield, earnings per share EPS, price to earnings growth, debt-to-equity ratio, price-to-book ratio, current ratio, free cash flow, projected earnings growth, return on equity, dividend payout ratio, price to sales ratio, credit rating)

Technical indicators (e.g., moving averages, RSI, MACD, average directional index, aroon oscillator, stochastic oscillator, on-balance volume, accumulation/distribution A/D line, parabolic SAR indicator, bollinger bands indicators, fibonacci, williams percent range, commodity channel index)

Machine learning features:

Feature engineering based on financial data and technical indicators

Sentiment analysis data from social media and news articles

Macroeconomic data (e.g., GDP, unemployment rate, interest rates, consumer spending, building permits, consumer confidence, inflation, producer price index, money supply, home sales, retail sales, bond yields)

Potential Applications:

Stock price prediction

Portfolio optimization

Algorithmic trading

Market sentiment analysis

Risk management

Use Cases:

Researchers investigating the effectiveness of machine learning in stock market prediction

Analysts developing quantitative trading Buy/Sell strategies

Individuals interested in building their own stock market prediction models

Students learning about machine learning and financial applications

Additional Notes:

The dataset may include different levels of granularity (e.g., daily, hourly)

Data cleaning and preprocessing are essential before model training

Regular updates are recommended to maintain the accuracy and relevance of the data
How accurate is machine learning in stock market? (LON:REL Stock Forecast)...
kappasignal.com
Updated Sep 18, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KappaSignal (2022). How accurate is machine learning in stock market? (LON:REL Stock Forecast) (Forecast) [Dataset]. https://www.kappasignal.com/2022/09/how-accurate-is-machine-learning-in_79.html
Explore at:
Dataset updated
Sep 18, 2022
Dataset authored and provided by
KappaSignal
License
https://www.kappasignal.com/p/legal-disclaimer.htmlhttps://www.kappasignal.com/p/legal-disclaimer.html
Description
This analysis presents a rigorous exploration of financial data, incorporating a diverse range of statistical features. By providing a robust foundation, it facilitates advanced research and innovative modeling techniques within the field of finance.

How accurate is machine learning in stock market? (LON:REL Stock Forecast)

Financial data:

Historical daily stock prices (open, high, low, close, volume)

Fundamental data (e.g., market capitalization, price to earnings P/E ratio, dividend yield, earnings per share EPS, price to earnings growth, debt-to-equity ratio, price-to-book ratio, current ratio, free cash flow, projected earnings growth, return on equity, dividend payout ratio, price to sales ratio, credit rating)

Technical indicators (e.g., moving averages, RSI, MACD, average directional index, aroon oscillator, stochastic oscillator, on-balance volume, accumulation/distribution A/D line, parabolic SAR indicator, bollinger bands indicators, fibonacci, williams percent range, commodity channel index)

Machine learning features:

Feature engineering based on financial data and technical indicators

Sentiment analysis data from social media and news articles

Macroeconomic data (e.g., GDP, unemployment rate, interest rates, consumer spending, building permits, consumer confidence, inflation, producer price index, money supply, home sales, retail sales, bond yields)

Potential Applications:

Stock price prediction

Portfolio optimization

Algorithmic trading

Market sentiment analysis

Risk management

Use Cases:

Researchers investigating the effectiveness of machine learning in stock market prediction

Analysts developing quantitative trading Buy/Sell strategies

Individuals interested in building their own stock market prediction models

Students learning about machine learning and financial applications

Additional Notes:

The dataset may include different levels of granularity (e.g., daily, hourly)

Data cleaning and preprocessing are essential before model training

Regular updates are recommended to maintain the accuracy and relevance of the data
m
Synthetic Stroke Prediction Dataset
data.mendeley.com
kaggle.com
Updated May 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammed Borhan Uddin (2025). Synthetic Stroke Prediction Dataset [Dataset]. http://doi.org/10.17632/s2nh6fm925.1
Explore at:
Unique identifier
https://doi.org/10.17632/s2nh6fm925.1
Dataset updated
May 2, 2025
Authors
Mohammed Borhan Uddin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is a synthetic version inspired by the original "Stroke Prediction Dataset" on Kaggle. It contains anonymized, artificially generated data intended for research and model training on healthcare-related stroke prediction. The dataset generated using GPT-4o contains 50,000 records and 12 features. The target variable is stroke, a binary classification where 1 represents stroke occurrence and 0 represents no stroke. The dataset includes both numerical and categorical features, requiring preprocessing steps before analysis. A small portion of the entries includes intentionally introduced missing values to allow users to practice various data preprocessing techniques such as imputation, missing data analysis, and cleaning. The dataset is suitable for educational and research purposes, particularly in machine learning tasks related to classification, healthcare analytics, and data cleaning. No real-world patient information was used in creating this dataset.
IMNN Stock Forecast (Forecast)
kappasignal.com
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KappaSignal (2025). IMNN Stock Forecast (Forecast) [Dataset]. https://www.kappasignal.com/2025/05/imnn-stock-forecast.html
Explore at:
Dataset updated
May 6, 2025
Dataset authored and provided by
KappaSignal
License
https://www.kappasignal.com/p/legal-disclaimer.htmlhttps://www.kappasignal.com/p/legal-disclaimer.html
Description
This analysis presents a rigorous exploration of financial data, incorporating a diverse range of statistical features. By providing a robust foundation, it facilitates advanced research and innovative modeling techniques within the field of finance.

IMNN Stock Forecast

Financial data:

Historical daily stock prices (open, high, low, close, volume)

Fundamental data (e.g., market capitalization, price to earnings P/E ratio, dividend yield, earnings per share EPS, price to earnings growth, debt-to-equity ratio, price-to-book ratio, current ratio, free cash flow, projected earnings growth, return on equity, dividend payout ratio, price to sales ratio, credit rating)

Technical indicators (e.g., moving averages, RSI, MACD, average directional index, aroon oscillator, stochastic oscillator, on-balance volume, accumulation/distribution A/D line, parabolic SAR indicator, bollinger bands indicators, fibonacci, williams percent range, commodity channel index)

Machine learning features:

Feature engineering based on financial data and technical indicators

Sentiment analysis data from social media and news articles

Macroeconomic data (e.g., GDP, unemployment rate, interest rates, consumer spending, building permits, consumer confidence, inflation, producer price index, money supply, home sales, retail sales, bond yields)

Potential Applications:

Stock price prediction

Portfolio optimization

Algorithmic trading

Market sentiment analysis

Risk management

Use Cases:

Researchers investigating the effectiveness of machine learning in stock market prediction

Analysts developing quantitative trading Buy/Sell strategies

Individuals interested in building their own stock market prediction models

Students learning about machine learning and financial applications

Additional Notes:

The dataset may include different levels of granularity (e.g., daily, hourly)

Data cleaning and preprocessing are essential before model training

Regular updates are recommended to maintain the accuracy and relevance of the data
m
ML-ready Curie temperatures and descriptors extracted from the JuHemd...
archive.materialscloud.org
application/gzip +1
Updated Dec 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robin Hilgers; Daniel Wortmann; Stefan Blügel; Robin Hilgers; Daniel Wortmann; Stefan Blügel (2022). ML-ready Curie temperatures and descriptors extracted from the JuHemd database [Dataset]. http://doi.org/10.24435/materialscloud:w1-yf
Explore at:
application/gzip, text/markdownAvailable download formats
Unique identifier
https://doi.org/10.24435/materialscloud:w1-yf
Dataset updated
Dec 19, 2022
Dataset provided by
Materials Cloud
Authors
Robin Hilgers; Daniel Wortmann; Stefan Blügel; Robin Hilgers; Daniel Wortmann; Stefan Blügel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The uploaded archive provides a ML-ready data set extracted from the juHemd database (see references) augmented with supplemental data for atomic descriptors. Descriptors provided in this data set include structural, magnetic, atomic quantities as well as derived (summed) quantities. In total, 118 possible descriptors are included of which 12 are DFT generated. For each simulation type (LDA/GGA) there is also a data set cleaned from DFT data available.

After data cleaning and preprocessing we extracted 387 LDA calculated magnetic Heusler structures as well as 408 GGA structures which have a full structural and magnetic data set. As we only aim at magnetic compounds, we chose to filter out compounds from the original JuHemd which have at least 0.1 Bohr magneton as total absolute magnetic moment. For each data file there is an existing descriptor file naming all the descriptors included in the data set.

Facebook

Twitter

Click to copy link

Link copied

Cite

Shrishti Manja (2024). Ecommerce Dataset for Data Analysis [Dataset]. https://www.kaggle.com/datasets/shrishtimanja/ecommerce-dataset-for-data-analysis/code

Ecommerce Dataset for Data Analysis

Exploratory Data Analysis, Data Visualisation and Machine Learning

Explore at:

zip(2028853 bytes)Available download formats

Dataset updated

Sep 19, 2024

Authors

Shrishti Manja

Description

This dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.

About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.

Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.

This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.

This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning

Clear search

Close search

Google apps

Main menu

Ecommerce Dataset for Data Analysis

Cleaning Code and Preprocessing

Dataset

Contents

Excel Data Cleaning - Montgomery Fleet Inventory

S1 Data -

Advanced Health Intelligence (AHI): Revolutionizing Healthcare through AI...

Advanced Health Intelligence (AHI): Revolutionizing Healthcare through AI and Data Analytics?

Financial data:

Machine learning features:

Potential Applications:

Use Cases:

Additional Notes:

Digital game addiction data version 2

Cdd Dataset

Medical Clean Dataset

Quadrise Fuels: A Burning Question for (QED) Stock (Forecast)

Quadrise Fuels: A Burning Question for (QED) Stock

Financial data:

Machine learning features:

Potential Applications:

Use Cases:

Additional Notes:

Oncolytics Biotech Inc. Forecast & Analysis (Forecast)

Oncolytics Biotech Inc. Forecast & Analysis

Financial data:

Machine learning features:

Potential Applications:

Use Cases:

Additional Notes:

Probabilistic AI: A New Approach to Artificial Intelligence (Forecast)

Probabilistic AI: A New Approach to Artificial Intelligence

Financial data:

Machine learning features:

Potential Applications:

Use Cases:

Additional Notes:

The desensitized dataset of online comments about the autonomous vehicle...

Can we predict stock market using machine learning? (FZO Stock Forecast)...

Can we predict stock market using machine learning? (FZO Stock Forecast)

Financial data:

Machine learning features:

Potential Applications:

Use Cases:

Additional Notes:

Confusion matrix.

Predicting Ventilator-Associated Pneumonia in ICU Patients with Type 2...

How accurate is machine learning in stock market? (SVT Stock Forecast)...

How accurate is machine learning in stock market? (SVT Stock Forecast)

Financial data:

Machine learning features:

Potential Applications:

Use Cases:

Additional Notes:

How accurate is machine learning in stock market? (LON:REL Stock Forecast)...

How accurate is machine learning in stock market? (LON:REL Stock Forecast)

Financial data:

Machine learning features:

Potential Applications:

Use Cases:

Additional Notes:

Synthetic Stroke Prediction Dataset

IMNN Stock Forecast (Forecast)

IMNN Stock Forecast

Financial data:

Machine learning features:

Potential Applications:

Use Cases:

Additional Notes:

ML-ready Curie temperatures and descriptors extracted from the JuHemd...

Ecommerce Dataset for Data Analysis

Exploratory Data Analysis, Data Visualisation and Machine Learning