80 datasets found

f
Orange dataset table
figshare.com
xlsx
Updated Mar 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rui Simões (2022). Orange dataset table [Dataset]. http://doi.org/10.6084/m9.figshare.19146410.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19146410.v1
Dataset updated
Mar 4, 2022
Dataset provided by
figshare
Authors
Rui Simões
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The complete dataset used in the analysis comprises 36 samples, each described by 11 numeric features and 1 target. The attributes considered were caspase 3/7 activity, Mitotracker red CMXRos area and intensity (3 h and 24 h incubations with both compounds), Mitosox oxidation (3 h incubation with the referred compounds) and oxidation rate, DCFDA fluorescence (3 h and 24 h incubations with either compound) and oxidation rate, and DQ BSA hydrolysis. The target of each instance corresponds to one of the 9 possible classes (4 samples per class): Control, 6.25, 12.5, 25 and 50 µM for 6-OHDA and 0.03, 0.06, 0.125 and 0.25 µM for rotenone. The dataset is balanced, it does not contain any missing values and data was standardized across features. The small number of samples prevented a full and strong statistical analysis of the results. Nevertheless, it allowed the identification of relevant hidden patterns and trends.

Exploratory data analysis, information gain, hierarchical clustering, and supervised predictive modeling were performed using Orange Data Mining version 3.25.1 [41]. Hierarchical clustering was performed using the Euclidean distance metric and weighted linkage. Cluster maps were plotted to relate the features with higher mutual information (in rows) with instances (in columns), with the color of each cell representing the normalized level of a particular feature in a specific instance. The information is grouped both in rows and in columns by a two-way hierarchical clustering method using the Euclidean distances and average linkage. Stratified cross-validation was used to train the supervised decision tree. A set of preliminary empirical experiments were performed to choose the best parameters for each algorithm, and we verified that, within moderate variations, there were no significant changes in the outcome. The following settings were adopted for the decision tree algorithm: minimum number of samples in leaves: 2; minimum number of samples required to split an internal node: 5; stop splitting when majority reaches: 95%; criterion: gain ratio. The performance of the supervised model was assessed using accuracy, precision, recall, F-measure and area under the ROC curve (AUC) metrics.
f
Data from: The Often-Overlooked Power of Summary Statistics in Exploratory...
acs.figshare.com
xlsx
Updated Jun 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tahereh G. Avval; Behnam Moeini; Victoria Carver; Neal Fairley; Emily F. Smith; Jonas Baltrusaitis; Vincent Fernandez; Bonnie. J. Tyler; Neal Gallagher; Matthew R. Linford (2023). The Often-Overlooked Power of Summary Statistics in Exploratory Data Analysis: Comparison of Pattern Recognition Entropy (PRE) to Other Summary Statistics and Introduction of Divided Spectrum-PRE (DS-PRE) [Dataset]. http://doi.org/10.1021/acs.jcim.1c00244.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.1c00244.s002
Dataset updated
Jun 8, 2023
Dataset provided by
ACS Publications
Authors
Tahereh G. Avval; Behnam Moeini; Victoria Carver; Neal Fairley; Emily F. Smith; Jonas Baltrusaitis; Vincent Fernandez; Bonnie. J. Tyler; Neal Gallagher; Matthew R. Linford
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Unsupervised exploratory data analysis (EDA) is often the first step in understanding complex data sets. While summary statistics are among the most efficient and convenient tools for exploring and describing sets of data, they are often overlooked in EDA. In this paper, we show multiple case studies that compare the performance, including clustering, of a series of summary statistics in EDA. The summary statistics considered here are pattern recognition entropy (PRE), the mean, standard deviation (STD), 1-norm, range, sum of squares (SSQ), and X4, which are compared with principal component analysis (PCA), multivariate curve resolution (MCR), and/or cluster analysis. PRE and the other summary statistics are direct methods for analyzing datathey are not factor-based approaches. To quantify the performance of summary statistics, we use the concept of the “critical pair,” which is employed in chromatography. The data analyzed here come from different analytical methods. Hyperspectral images, including one of a biological material, are also analyzed. In general, PRE outperforms the other summary statistics, especially in image analysis, although a suite of summary statistics is useful in exploring complex data sets. While PRE results were generally comparable to those from PCA and MCR, PRE is easier to apply. For example, there is no need to determine the number of factors that describe a data set. Finally, we introduce the concept of divided spectrum-PRE (DS-PRE) as a new EDA method. DS-PRE increases the discrimination power of PRE. We also show that DS-PRE can be used to provide the inputs for the k-nearest neighbor (kNN) algorithm. We recommend PRE and DS-PRE as rapid new tools for unsupervised EDA.
E
Exploratory Data Analysis (EDA) Tools Report
marketreportanalytics.com
doc, pdf, ppt
Updated Apr 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Report Analytics (2025). Exploratory Data Analysis (EDA) Tools Report [Dataset]. https://www.marketreportanalytics.com/reports/exploratory-data-analysis-eda-tools-54257
Explore at:
ppt, doc, pdfAvailable download formats
Dataset updated
Apr 2, 2025
Dataset authored and provided by
Market Report Analytics
License
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Exploratory Data Analysis (EDA) tools market is experiencing robust growth, driven by the increasing need for businesses to derive actionable insights from their ever-expanding datasets. The market, currently estimated at $15 billion in 2025, is projected to witness a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, reaching an estimated $45 billion by 2033. This growth is fueled by several factors, including the rising adoption of big data analytics, the proliferation of cloud-based solutions offering enhanced accessibility and scalability, and the growing demand for data-driven decision-making across diverse industries like finance, healthcare, and retail. The market is segmented by application (large enterprises and SMEs) and type (graphical and non-graphical tools), with graphical tools currently holding a larger market share due to their user-friendly interfaces and ability to effectively communicate complex data patterns. Large enterprises are currently the dominant segment, but the SME segment is anticipated to experience faster growth due to increasing affordability and accessibility of EDA solutions. Geographic expansion is another key driver, with North America currently holding the largest market share due to early adoption and a strong technological ecosystem. However, regions like Asia-Pacific are exhibiting high growth potential, fueled by rapid digitalization and a burgeoning data science talent pool. Despite these opportunities, the market faces certain restraints, including the complexity of some EDA tools requiring specialized skills and the challenge of integrating EDA tools with existing business intelligence platforms. Nonetheless, the overall market outlook for EDA tools remains highly positive, driven by ongoing technological advancements and the increasing importance of data analytics across all sectors. The competition among established players like IBM Cognos Analytics and Altair RapidMiner, and emerging innovative companies like Polymer Search and KNIME, further fuels market dynamism and innovation.
R
Solar Panel Eda Dataset
universe.roboflow.com
zip
Updated Aug 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ramkumar (2024). Solar Panel Eda Dataset [Dataset]. https://universe.roboflow.com/ramkumar/solar-panel-eda
Explore at:
zipAvailable download formats
Dataset updated
Aug 29, 2024
Dataset authored and provided by
Ramkumar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Solar Panel Bounding Boxes
Description
Solar Panel EDA

## Overview Solar Panel EDA is a dataset for object detection tasks - it contains Solar Panel annotations for 721 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
R
Enbor Eda Dataset
universe.roboflow.com
zip
Updated Sep 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
2014 Series License Plate (2024). Enbor Eda Dataset [Dataset]. https://universe.roboflow.com/2014-series-license-plate/enbor-eda
Explore at:
zipAvailable download formats
Dataset updated
Sep 27, 2024
Dataset authored and provided by
2014 Series License Plate
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
License Plate MgLz Licenseplate Bounding Boxes
Description
Enbor Eda

## Overview Enbor Eda is a dataset for object detection tasks - it contains License Plate MgLz Licenseplate annotations for 3,120 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
f
DataSheet1_Exploratory data analysis (EDA) machine learning approaches for...
frontiersin.figshare.com
docx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Victoria Da Poian; Bethany Theiling; Lily Clough; Brett McKinney; Jonathan Major; Jingyi Chen; Sarah Hörst (2023). DataSheet1_Exploratory data analysis (EDA) machine learning approaches for ocean world analog mass spectrometry.docx [Dataset]. http://doi.org/10.3389/fspas.2023.1134141.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fspas.2023.1134141.s001
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Victoria Da Poian; Bethany Theiling; Lily Clough; Brett McKinney; Jonathan Major; Jingyi Chen; Sarah Hörst
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
World
Description
Many upcoming and proposed missions to ocean worlds such as Europa, Enceladus, and Titan aim to evaluate their habitability and the existence of potential life on these moons. These missions will suffer from communication challenges and technology limitations. We review and investigate the applicability of data science and unsupervised machine learning (ML) techniques on isotope ratio mass spectrometry data (IRMS) from volatile laboratory analogs of Europa and Enceladus seawaters as a case study for development of new strategies for icy ocean world missions. Our driving science goal is to determine whether the mass spectra of volatile gases could contain information about the composition of the seawater and potential biosignatures. We implement data science and ML techniques to investigate what inherent information the spectra contain and determine whether a data science pipeline could be designed to quickly analyze data from future ocean worlds missions. In this study, we focus on the exploratory data analysis (EDA) step in the analytics pipeline. This is a crucial unsupervised learning step that allows us to understand the data in depth before subsequent steps such as predictive/supervised learning. EDA identifies and characterizes recurring patterns, significant correlation structure, and helps determine which variables are redundant and which contribute to significant variation in the lower dimensional space. In addition, EDA helps to identify irregularities such as outliers that might be due to poor data quality. We compared dimensionality reduction methods Uniform Manifold Approximation and Projection (UMAP) and Principal Component Analysis (PCA) for transforming our data from a high-dimensional space to a lower dimension, and we compared clustering algorithms for identifying data-driven groups (“clusters”) in the ocean worlds analog IRMS data and mapping these clusters to experimental conditions such as seawater composition and CO2 concentration. Such data analysis and characterization efforts are the first steps toward the longer-term science autonomy goal where similar automated ML tools could be used onboard a spacecraft to prioritize data transmissions for bandwidth-limited outer Solar System missions.
f
EDA augmentation parameters.
plos.figshare.com
xls
Updated Sep 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda (2024). EDA augmentation parameters. [Dataset]. http://doi.org/10.1371/journal.pone.0310707.t009
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310707.t009
Dataset updated
Sep 26, 2024
Dataset provided by
PLOS ONE
Authors
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.
R
Eda_all Dataset
universe.roboflow.com
zip
Updated May 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
cropperyash (2024). Eda_all Dataset [Dataset]. https://universe.roboflow.com/cropperyash/eda_all
Explore at:
zipAvailable download formats
Dataset updated
May 24, 2024
Dataset authored and provided by
cropperyash
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
All Polygons
Description
Eda_all

## Overview Eda_all is a dataset for instance segmentation tasks - it contains All annotations for 1,314 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Data: Anscombe's quintet
kaggle.com
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carl McBride Ellis (2025). Data: Anscombe's quintet [Dataset]. https://www.kaggle.com/datasets/carlmcbrideellis/data-anscombes-quartet/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 17, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Carl McBride Ellis
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This file is the data set form the famous publication Francis J. Anscombe "*Graphs in Statistical Analysis*", The American Statistician 27 pp. 17-21 (1973) (doi: 10.1080/00031305.1973.10478966). It consists of four data sets of 11 points each. Note the peculiarity that the same 'x' values are used for the first three data sets, and I have followed this exactly as in the original publication (originally done to save space), i.e. the first column (x123) serves as the 'x' for the next three 'y' columns; y1, y2 and y3.

In the dataset Anscombe_quintet_data.csv there is a new column (y5) as an example of Simpson's paradox (C. McBride Ellis "*Anscombe dataset No. 5: Simpson's paradox*", Zenodo doi: 10.5281/zenodo.15209087 (2025)
R
Enbor Eda Article 2 V4 Dataset
universe.roboflow.com
zip
Updated Sep 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
2014 Series License Plate (2024). Enbor Eda Article 2 V4 Dataset [Dataset]. https://universe.roboflow.com/2014-series-license-plate/enbor-eda-article-2-v4
Explore at:
zipAvailable download formats
Dataset updated
Sep 29, 2024
Dataset authored and provided by
2014 Series License Plate
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Plate AQOx Plate XoZD License Plate MgLz Licenseplate E90T 00Hp Bounding Boxes
Description
Enbor Eda Article 2 V4

## Overview Enbor Eda Article 2 V4 is a dataset for object detection tasks - it contains Plate AQOx Plate XoZD License Plate MgLz Licenseplate E90T 00Hp annotations for 2,125 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
A
‘US Health Insurance Dataset’ analyzed by Analyst-2
analyst-2.ai
Updated Nov 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘US Health Insurance Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-us-health-insurance-dataset-8b56/068994aa/?iid=012-655&v=presentation
Explore at:
Dataset updated
Nov 15, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘US Health Insurance Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/teertha/ushealthinsurancedataset on 12 November 2021.

--- Dataset description provided by original source is as follows ---

Context

The venerable insurance industry is no stranger to data driven decision making. Yet in today's rapidly transforming digital landscape, Insurance is struggling to adapt and benefit from new technologies compared to other industries, even within the BFSI sphere (compared to the Banking sector for example.) Extremely complex underwriting rule-sets that are radically different in different product lines, many non-KYC environments with a lack of centralized customer information base, complex relationship with consumers in traditional risk underwriting where sometimes customer centricity runs reverse to business profit, inertia of regulatory compliance - are some of the unique challenges faced by Insurance Business.

Despite this, emergent technologies like AI and Block Chain have brought a radical change in Insurance, and Data Analytics sits at the core of this transformation. We can identify 4 key factors behind the emergence of Analytics as a crucial part of InsurTech:

Big Data: The explosion of unstructured data in the form of images, videos, text, emails, social media

AI: The recent advances in Machine Learning and Deep Learning that can enable businesses to gain insight, do predictive analytics and build cost and time - efficient innovative solutions

Real time Processing: Ability of real time information processing through various data feeds (for ex. social media, news)

Increased Computing Power: a complex ecosystem of new analytics vendors and solutions that enable carriers to combine data sources, external insights, and advanced modeling techniques in order to glean insights that were not possible before.

This dataset can be helpful in a simple yet illuminating study in understanding the risk underwriting in Health Insurance, the interplay of various attributes of the insured and see how they affect the insurance premium.

Content

This dataset contains 1338 rows of insured data, where the Insurance charges are given against the following attributes of the insured: Age, Sex, BMI, Number of Children, Smoker and Region. There are no missing or undefined values in the dataset.

Inspiration

This relatively simple dataset should be an excellent starting point for EDA, Statistical Analysis and Hypothesis testing and training Linear Regression models for predicting Insurance Premium Charges.

Proposed Tasks: - Exploratory Data Analytics - Statistical hypothesis testing - Statistical Modeling - Linear Regression

--- Original source retains full ownership of the source dataset ---
Understanding Fatigue Through Biosignals: A Comprehensive Dataset
zenodo.org
Updated Mar 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marta Gabbi; Marta Gabbi; Luca Cornia; Luca Cornia; Valeria Villani; Valeria Villani; Lorenzo Sabattini; Lorenzo Sabattini (2024). Understanding Fatigue Through Biosignals: A Comprehensive Dataset [Dataset]. http://doi.org/10.5281/zenodo.8423405
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.8423405
Dataset updated
Mar 26, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marta Gabbi; Marta Gabbi; Luca Cornia; Luca Cornia; Valeria Villani; Valeria Villani; Lorenzo Sabattini; Lorenzo Sabattini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Fatigue is a multifaceted construct, that represents an important part of human experience. The two main aspects of fatigue are the mental one and the physical one, that often intertwine, intensifying their collective impact on daily life and overall well-being.
To soften this impact, understanding and quantifying fatigue is crucial. Physiological data play a pivotal role in the comprehension of fatigue, allowing a precious insight into the level and type of fatigue experienced.

The MePhy dataset includes physiological data gathered while inducing different types of fatigue conditions, in particular mental and physical fatigue. We collected various biosignals closely associated with fatigue (ECG, EDA, EMG and Eye Blinking). Test participants endured a four-part experiment that aimed to elicit mental fatigue, physical fatigue and a combination of both.

The main folder contains:

MePhy Dataset folder, which contains the dataset;

ReadMe.pdf, which provides more informations about the dataset;

Mental Fatigue Inducing Test folder, which includes the HTML application used to simulate mental fatigue in the test participants.

A more in depth description of the MePhy dataset can be found in the following paper https://doi.org/10.1145/3610977.3637485.

Marta Gabbi, Luca Cornia, Valeria Villani, and Lorenzo Sabattini (2024) Understanding Fatigue Through Biosignals: A Comprehensive Dataset. In Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction (HRI ’24).
A
‘Groceries dataset for Market Basket Analysis(MBA)’ analyzed by Analyst-2
analyst-2.ai
Updated Aug 4, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2020). ‘Groceries dataset for Market Basket Analysis(MBA)’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-groceries-dataset-for-market-basket-analysis-mba-d4c7/a0d6998a/?iid=009-334&v=presentation
Explore at:
Dataset updated
Aug 4, 2020
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Groceries dataset for Market Basket Analysis(MBA)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/rashikrahmanpritom/groceries-dataset-for-market-basket-analysismba on 13 November 2021.

--- Dataset description provided by original source is as follows ---

The initial dataset was collected from Groceries dataset. Then data was modified and fragmented into 2 datasets for ease of MBA implementation. Here the "groceries data.csv" contains groceries transaction data from which you can do EDA and pre-process the data to feed it in the apriori algorithm. But I have also added pre-processed data as "basket.csv" from which you'll just need to replace nan and encode it using TransactionEncoder after that you can feed the encoded data into the apriori algorithm.

--- Original source retains full ownership of the source dataset ---
M
Surface Water Stations - MPCA Environmental Data Access
gisdata.mn.gov
data.wu.ac.at
fgdb, gpkg, html +2
Updated Aug 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pollution Control Agency (2025). Surface Water Stations - MPCA Environmental Data Access [Dataset]. https://gisdata.mn.gov/dataset/env-eda-surfacewater-stations
Explore at:
gpkg, html, jpeg, fgdb, shpAvailable download formats
Dataset updated
Aug 30, 2025
Dataset provided by
Minnesota Pollution Control Agency
Description
Minnesota Pollution Control Agency (MPCA) surface water monitoring station locations, including lake, stream, biological and discharge. Locations of United States Geological Survey (USGS) stream flow stations are also included. This data set was created as part of MPCA's Environmental Data Access project, which was designed to provide internet access to MPCA's surface water monitoring data. The data set contains locational data and limited attributes for all MPCA stream chemistry stations, MPCA lake monitoring stations, MPCA stream biology stations, MPCA permitted dischargers [National Pollutant Discharge Elimination System (NPDES) permits], and (locations only) of USGS stream flow stations. MPCA lake and stream monitoring stations are the same stations found in MPCA's EQuIS database.
t
Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and...
test.researchdata.tuwien.ac.at
bin, csv, json +1
Updated Apr 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dilara Çakmak; Dilara Çakmak; Dilara Çakmak; Dilara Çakmak (2025). Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and Performance Analysis [Dataset]. http://doi.org/10.70124/f5t2d-xt904
Explore at:
csv, text/markdown, json, binAvailable download formats
Unique identifier
https://doi.org/10.70124/f5t2d-xt904
Dataset updated
Apr 28, 2025
Dataset provided by
TU Wien
Authors
Dilara Çakmak; Dilara Çakmak; Dilara Çakmak; Dilara Çakmak
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 2025
Description
Context and Methodology

Research Domain:
The dataset is part of a project focused on retail sales forecasting. Specifically, it is designed to predict daily sales for Rossmann, a chain of over 3,000 drug stores operating across seven European countries. The project falls under the broader domain of time series analysis and machine learning applications for business optimization. The goal is to apply machine learning techniques to forecast future sales based on historical data, which includes factors like promotions, competition, holidays, and seasonal trends.

Purpose:
The primary purpose of this dataset is to help Rossmann store managers predict daily sales for up to six weeks in advance. By making accurate sales predictions, Rossmann can improve inventory management, staffing decisions, and promotional strategies. This dataset serves as a training set for machine learning models aimed at reducing forecasting errors and supporting decision-making processes across the company’s large network of stores.

How the Dataset Was Created:
The dataset was compiled from several sources, including historical sales data from Rossmann stores, promotional calendars, holiday schedules, and external factors such as competition. The data is split into multiple features, such as the store's location, promotion details, whether the store was open or closed, and weather information. The dataset is publicly available on platforms like Kaggle and was initially created for the Kaggle Rossmann Store Sales competition. The data is made accessible via an API for further analysis and modeling, and it is structured to help machine learning models predict future sales based on various input variables.

Technical Details

Dataset Structure:

The dataset consists of three main files, each with its specific role:

Train:
This file contains the historical sales data, which is used to train machine learning models. It includes daily sales information for each store, as well as various features that could influence the sales (e.g., promotions, holidays, store type, etc.).

https://handle.test.datacite.org/10.82556/yb6j-jw41
PID: b1c59499-9c6e-42c2-af8f-840181e809db

Test2:
The test dataset mirrors the structure of train.csv but does not include the actual sales values (i.e., the target variable). This file is used for making predictions using the trained machine learning models. It is used to evaluate the accuracy of predictions when the true sales data is unknown.

https://handle.test.datacite.org/10.82556/jerg-4b84
PID: 7cbb845c-21dd-4b60-b990-afa8754a0dd9

Store:
This file provides metadata about each store, including information such as the store’s location, type, and assortment level. This data is essential for understanding the context in which the sales data is gathered.

https://handle.test.datacite.org/10.82556/nqeg-gy34
PID: 9627ec46-4ee6-4969-b14a-bda555fe34db

Data Fields Description:

Id: A unique identifier for each (Store, Date) combination within the test set.

Store: A unique identifier for each store.

Sales: The daily turnover (target variable) for each store on a specific day (this is what you are predicting).

Customers: The number of customers visiting the store on a given day.

Open: An indicator of whether the store was open (1 = open, 0 = closed).

StateHoliday: Indicates if the day is a state holiday, with values like:

'a' = public holiday,

'b' = Easter holiday,

'c' = Christmas,

'0' = no holiday.

SchoolHoliday: Indicates whether the store is affected by school closures (1 = yes, 0 = no).

StoreType: Differentiates between four types of stores: 'a', 'b', 'c', 'd'.

Assortment: Describes the level of product assortment in the store:

'a' = basic,

'b' = extra,

'c' = extended.

CompetitionDistance: Distance (in meters) to the nearest competitor store.

CompetitionOpenSince[Month/Year]: The month and year when the nearest competitor store opened.

Promo: Indicates whether the store is running a promotion on a particular day (1 = yes, 0 = no).

Promo2: Indicates whether the store is participating in Promo2, a continuing promotion for some stores (1 = participating, 0 = not participating).

Promo2Since[Year/Week]: The year and calendar week when the store started participating in Promo2.

PromoInterval: Describes the months when Promo2 is active, e.g., "Feb,May,Aug,Nov" means the promotion starts in February, May, August, and November.

Software Requirements

To work with this dataset, you will need to have specific software installed, including:

DBRepo Authorization: This is required to access the datasets via the DBRepo API. You may need to authenticate with an API key or login credentials to retrieve the datasets.

Python Libraries: Key libraries for working with the dataset include:

pandas for data manipulation,

numpy for numerical operations,

matplotlib and seaborn for data visualization,

scikit-learn for machine learning algorithms.

Additional Resources

Several additional resources are available for working with the dataset:

Presentation:
A presentation summarizing the exploratory data analysis (EDA), feature engineering process, and key insights from the analysis is provided. This presentation also includes visualizations that help in understanding the dataset’s trends and relationships.

Jupyter Notebook:
A Jupyter notebook, titled Retail_Sales_Prediction_Capstone_Project.ipynb, is provided, which details the entire machine learning pipeline, from data loading and cleaning to model training and evaluation.

Model Evaluation Results:
The project includes a detailed evaluation of various machine learning models, including their performance metrics like training and testing scores, Mean Absolute Percentage Error (MAPE), and Root Mean Squared Error (RMSE). This allows for a comparison of model effectiveness in forecasting sales.

Trained Models (.pkl files):
The models trained during the project are saved as .pkl files. These files contain the trained machine learning models (e.g., Random Forest, Linear Regression, etc.) that can be loaded and used to make predictions without retraining the models from scratch.

sample_submission.csv:
This file is a sample submission file that demonstrates the format of predictions expected when using the trained model. The sample_submission.csv contains predictions made on the test dataset using the trained Random Forest model. It provides an example of how the output should be structured for submission.

These resources provide a comprehensive guide to implementing and analyzing the sales forecasting model, helping you understand the data, methods, and results in greater detail.
f
Cognitive Fatigue
figshare.com
csv
Updated Jun 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rui Varandas; Inês Silveira; Hugo Gamboa (2025). Cognitive Fatigue [Dataset]. http://doi.org/10.6084/m9.figshare.28188143.v3
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28188143.v3
Dataset updated
Jun 4, 2025
Dataset provided by
figshare
Authors
Rui Varandas; Inês Silveira; Hugo Gamboa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cognitive Fatigue2.1. Experimental designCognitive fatigue (CF) is a phenomenon that arises following the prolonged engagement in mentally demanding cognitive tasks. Thus, we developed an experimental procedure that involved three demanding tasks: a digital lesson in Jupyter Notebook format, three repetitions of Corsi-Block task, and two repetitions of a concentration test.Before the Corsi-Block task and after the concentration task there were periods of baseline of two min. In our analysis, the first baseline period, although not explicitly present in the dataset, was designated as representing no CF, whereas the final baseline period was designated as representing the presence of CF. Between repetitions of the Corsi-Block task, there were periods of baseline of 15 s after the task and of 30 s before the beginning of each repetition of the task.2.2. Data recordingA data sample of 10 volunteer participants (4 females) aged between 22 and 48 years old (M = 28.2, SD = 7.6) took part in this study. All volunteers were recruited at NOVA School of Science and Technology, fluent in English, right-handed, none reported suffering from psychological disorders, and none reported taking regular medication. Written informed consent was obtained before participating and all Ethical Procedures approved by the Ethics Committee of NOVA University of Lisbon were thoroughly followed.In this study, we omitted the data from one participant due to the insufficient duration of data acquisition.2.3. Data labellingThe labels easy, difficult, very difficult and repeat found in the ECG_lesson_answers.txt files represent the subjects' opinion of the content read in the ECG lesson. The repeat label represents the most difficult level. It's called repeat because when you press it, the answer to the question is shown again. This system is based on the Anki system, which has been proposed and used to memorise information effectively. In addition, the PB description JSON files include timestamps indicating the start and end of cognitive tasks, baseline periods, and other events, which are useful for defining CF states as we defined in 2.1.2.4. Data descriptionBiosignals include EEG, fNIRS (not converted to oxi and deoxiHb), ECG, EDA, respiration (RIP), accelerometer (ACC), and push-button data (PB). All signals have already been converted to physical units. In each biosignal file, the first column corresponds to the timestamps.HCI features encompass keyboard, mouse, and screenshot data. Below is a Python code snippet for extracting screenshot files from the screenshots CSV file.import base64from os import mkdirfrom os.path import joinfile = '...'with open(file, 'r') as f: lines = f.readlines()for line in lines[1:]: timestamp = line.split(',')[0] code = line.split(',')[-1][:-2] imgdata = base64.b64decode(code) filename = str(timestamp) + '.jpeg' mkdir('screenshot') with open(join('screenshot', filename), 'wb') as f: f.write(imgdata)A characterization file containing age and gender information for all subjects in each dataset is provided within the respective dataset folder (e.g., D2_subject-info.csv). Other complementary files include (i) description of the pushbuttons to help segment the signals (e.g., D2_S2_PB_description.json) and (ii) labelling (e.g., D2_S2_ECG_lesson_results.txt). The files D2_Sx_results_corsi-block_board_1.json and D2_Sx_results_corsi-block_board_2.json show the results for the first and second iterations of the corsi-block task, where, for example, row_0_1 = 12 means that the subject got 12 pairs right in the first row of the first board, and row_0_2 = 12 means that the subject got 12 pairs right in the first row of the second board.

Replication Package of Deep Learning and Data Augmentation for Detecting...

zenodo.org

zip

Updated Apr 24, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Anonymous Anonymous; Anonymous Anonymous (2024). Replication Package of Deep Learning and Data Augmentation for Detecting Self-Admitted Technical Debt [Dataset]. http://doi.org/10.5281/zenodo.10521909

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.10521909

Dataset updated

Apr 24, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Anonymous Anonymous; Anonymous Anonymous

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Jan 17, 2024

Description

Self-Admitted Technical Debt (SATD) refers to circumstances where developers use code comments, issues, pull requests, or other textual artifacts to explain why the existing implementation is not optimal. Past research in detecting SATD has focused on either identifying SATD (classifying SATD instances as SATD or not) or categorizing SATD (labeling instances as SATD that pertain to requirements, design, code, test, etc.). However, the performance of such approaches remains suboptimal, particularly when dealing with specific types of SATD, such as test and requirement debt. This is mostly because the used datasets are extremely imbalanced.

In this study, we utilize a data augmentation strategy to address the problem of imbalanced data. We also employ a two-step approach to identify and categorize SATD on various datasets derived from different artifacts. Based on earlier research, a deep learning architecture called BiLSTM is utilized for the binary identification of SATD. The BERT architecture is then utilized to categorize different types of SATD. We provide the dataset of balanced classes as a contribution for future SATD researchers, and we also show that the performance of SATD identification and categorization using deep learning and our two-step approach is significantly better than baseline approaches.

Therefore, to showcase the effectiveness of our approach, we compared it against several existing approaches:

Natural Language Processing (NLP) and Matches task Annotation Tags (MAT) [Github]
eXtreme Gradient Boosting+Synthetic Minority Oversampling Technique (XGBoost+SMOTE) [Figshare]
eXtreme Gradient Boosting+Easy Data Augmentation (XGBoost+EDA) [Github]
MT-Text-CNN [Github]

Structure of the Replication Package:

In accordance with the original dataset, the dataset comprises four distinct CSV files delineated by the artifacts under consideration in this study. Each CSV file encompasses a text column and a class, which indicate classifications denoting specific types of SATD, namely code/design debt (C/D), documentation debt (DOC), test debt (TES), and requirement debt (REQ) or Not-SATD.

├── SATD Keywords

│ ├── Keywords based on Source of Artifacts

│ │ ├── Code comment.txt

│ │ ├── Commit message.txt

│ │ ├── Issue section.txt

│ │ └── Pull section.txt

│ ├── Keywords based on Types of SATD

│ │ ├── code-design debt.txt

│ │ ├── documentation debt.txt

│ │ ├── requirement debt.txt

│ │ └── test debt.txt

├── src

│ ├── bert.py

│ ├── bilstm.py

│ └── preprocessing.py

├── data-augmentation-code_comments.csv

├── data-augmentation-commit_messages.csv

├── data-augmentation-issues.csv

├── data-augmentation-pull_requests.csv

└── Supplementary Material.docx

Requirements:

glove

nltk

transformers

torch

tensorflow

keras

langdetect

inflect

inflection

Project sources for each artifact are as follows:

Source code comment	Issue section	Pull section	Commit message
ant argouml columba emf hibernate jedit jfreechart jmeter jruby squirrel	camel chromium gerrit hadoop hbase impala thrift	accumulo activemq activemq-artemis airflow ambari apisix apisix-dashboard arrow attic-apex-core attic-apex-malhar attic-stratos avro beam bigtop bookkeeper brooklyn-server calcite camel camel-k camel-quarkus camel-website carbondata cassandra cloudstack commons-lang couchdb cxf daffodil drill druid dubbo echarts fineract flink fluo geode geode-native gobblin griffin groovy guacamole-client hadoop hawq hbase helix hive hudi iceberg ignite incubator-brooklyn incubator-dolphinscheduler incubator-doris incubator-heron incubator-hop incubator-mxnet incubator-pagespeed-ngx incubator-pinot incubator-weex infrastructure-puppet jena jmeter kafka karaf kylin lucene-solr madlib myfaces-tobago netbeans netbeans-website nifi nifi-minifi-cpp nutch openwhisk openwhisk-wskdeploy orc ozone parquet-mr phoenix pulsar qpid-dispatch reef rocketmq samza servicecomb-java-chassis shardingsphere shardingsphere-elasticjob skywalking spark storm streams superset systemds tajo thrift tinkerpop tomee trafficcontrol trafficserver trafodion tvm usergrid zeppelin zookeeper	accumulo activemq activemq-artemis airflow ambari apisix apisix-dashboard arrow attic-apex-core attic-apex-malhar attic-stratos avro beam bigtop bookkeeper brooklyn-server calcite camel camel-k camel-quarkus camel-website carbondata cassandra cloudstack commons-lang couchdb cxf daffodil drill druid dubbo echarts fineract flink fluo geode geode-native gobblin griffin groovy guacamole-client hadoop hawq hbase helix hive hudi iceberg ignite incubator-brooklyn incubator-dolphinscheduler incubator-doris incubator-heron incubator-hop incubator-mxnet incubator-pagespeed-ngx incubator-pinot incubator-weex infrastructure-puppet jena jmeter kafka karaf kylin lucene-solr madlib myfaces-tobago netbeans netbeans-website nifi nifi-minifi-cpp nutch openwhisk openwhisk-wskdeploy orc ozone parquet-mr phoenix pulsar qpid-dispatch reef rocketmq samza servicecomb-java-chassis shardingsphere shardingsphere-elasticjob skywalking spark storm streams superset systemds tajo thrift tinkerpop tomee trafficcontrol trafficserver trafodion tvm usergrid zeppelin zookeeper

This dataset has undergone a data augmentation process using the AugGPT technique. Meanwhile, the original dataset can be downloaded via the following link: https://github.com/yikun-li/satd-different-sources-data

G-REx: A Real-World Dataset of Group Emotion Experiences based on...
zenodo.org
Updated Jul 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patricia Bota; Patricia Bota; Joana Brito; Ana Fred; Ana Fred; Pablo Cesar; Pablo Cesar; Hugo Plácido Silva; Hugo Plácido Silva; Joana Brito (2023). G-REx: A Real-World Dataset of Group Emotion Experiences based on Physiological Data [Dataset]. http://doi.org/10.5281/zenodo.8136135
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.8136135
Dataset updated
Jul 12, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Patricia Bota; Patricia Bota; Joana Brito; Ana Fred; Ana Fred; Pablo Cesar; Pablo Cesar; Hugo Plácido Silva; Hugo Plácido Silva; Joana Brito
Description
G-REX is a novel dataset for real-world affective computing, with data collected in a naturalistic setting during movie sessions. Group physiological data (Photoplethysmography – PPG and Electrodermal Activity – EDA) are collected using a wrist-worn unobtrusive device and emotion ground-truth annotation is performed retrospectively, based on selected segments where each subject showed an elevated physiological response. In total, we provide data over 14 movie sessions, making a total of more than 300 hours of physiological data collected over 100+ subjects in groups.
A
‘IMDB Movies Dataset’ analyzed by Analyst-2
analyst-2.ai
Updated Nov 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘IMDB Movies Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-imdb-movies-dataset-f301/9b433bd2/?iid=018-445&v=presentation
Explore at:
Dataset updated
Nov 13, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘IMDB Movies Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows on 13 November 2021.

--- Dataset description provided by original source is as follows ---

Context

IMDB Dataset of top 1000 movies and tv shows. You can find the EDA Process on - https://www.kaggle.com/harshitshankhdhar/eda-on-imdb-movies-dataset

Please consider UPVOTE if you found it useful.

Content

Data:- - Poster_Link - Link of the poster that imdb using - Series_Title = Name of the movie - Released_Year - Year at which that movie released - Certificate - Certificate earned by that movie - Runtime - Total runtime of the movie - Genre - Genre of the movie - IMDB_Rating - Rating of the movie at IMDB site - Overview - mini story/ summary - Meta_score - Score earned by the movie - Director - Name of the Director - Star1,Star2,Star3,Star4 - Name of the Stars - No_of_votes - Total number of votes - Gross - Money earned by that movie

Inspiration

Analysis of the gross of a movie vs directors.

Analysis of the gross of a movie vs different - different stars.

Analysis of the No_of_votes of a movie vs directors.

Analysis of the No_of_votes of a movie vs different - different stars.

Which actor prefer which Genre more?

Which combination of actors are getting good IMDB_Rating maximum time?

Which combination of actors are getting good gross?

--- Original source retains full ownership of the source dataset ---
T
Impact of AI in Education Processes
dataverse.tdl.org
Updated Feb 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saksham Adhikari; Saksham Adhikari (2024). Impact of AI in Education Processes [Dataset]. http://doi.org/10.18738/T8/RXUCHK
Explore at:
application/x-ipynb+json(428065), pptx(80640), tsv(7079)Available download formats
Unique identifier
https://doi.org/10.18738/T8/RXUCHK
Dataset updated
Feb 20, 2024
Dataset provided by
Texas Data Repository
Authors
Saksham Adhikari; Saksham Adhikari
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
We did data analysis on a open dataset which contained responses regarding a survey about how useful students find AI in the educational process. We cleaned the data, preprocessed and then did analysis on it. We did an EDA (Exploratory Data Analysis) on the dataset and visualized the results and our findings. Then we interpreted the findings into our digital poster.

Facebook

Twitter

Click to copy link

Link copied

Cite

Rui Simões (2022). Orange dataset table [Dataset]. http://doi.org/10.6084/m9.figshare.19146410.v1

Orange dataset table

Explore at:

3 scholarly articles cite this dataset (View in Google Scholar)

xlsxAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.19146410.v1

Dataset updated

Mar 4, 2022

Dataset provided by

figshare

Authors

Rui Simões

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The complete dataset used in the analysis comprises 36 samples, each described by 11 numeric features and 1 target. The attributes considered were caspase 3/7 activity, Mitotracker red CMXRos area and intensity (3 h and 24 h incubations with both compounds), Mitosox oxidation (3 h incubation with the referred compounds) and oxidation rate, DCFDA fluorescence (3 h and 24 h incubations with either compound) and oxidation rate, and DQ BSA hydrolysis. The target of each instance corresponds to one of the 9 possible classes (4 samples per class): Control, 6.25, 12.5, 25 and 50 µM for 6-OHDA and 0.03, 0.06, 0.125 and 0.25 µM for rotenone. The dataset is balanced, it does not contain any missing values and data was standardized across features. The small number of samples prevented a full and strong statistical analysis of the results. Nevertheless, it allowed the identification of relevant hidden patterns and trends.

Exploratory data analysis, information gain, hierarchical clustering, and supervised predictive modeling were performed using Orange Data Mining version 3.25.1 [41]. Hierarchical clustering was performed using the Euclidean distance metric and weighted linkage. Cluster maps were plotted to relate the features with higher mutual information (in rows) with instances (in columns), with the color of each cell representing the normalized level of a particular feature in a specific instance. The information is grouped both in rows and in columns by a two-way hierarchical clustering method using the Euclidean distances and average linkage. Stratified cross-validation was used to train the supervised decision tree. A set of preliminary empirical experiments were performed to choose the best parameters for each algorithm, and we verified that, within moderate variations, there were no significant changes in the outcome. The following settings were adopted for the decision tree algorithm: minimum number of samples in leaves: 2; minimum number of samples required to split an internal node: 5; stop splitting when majority reaches: 95%; criterion: gain ratio. The performance of the supervised model was assessed using accuracy, precision, recall, F-measure and area under the ROC curve (AUC) metrics.

Clear search

Close search

Google apps

Main menu

Orange dataset table

Data from: The Often-Overlooked Power of Summary Statistics in Exploratory...

Exploratory Data Analysis (EDA) Tools Report

Solar Panel Eda Dataset

Solar Panel EDA

Enbor Eda Dataset

Enbor Eda

DataSheet1_Exploratory data analysis (EDA) machine learning approaches for...

EDA augmentation parameters.

Eda_all Dataset

Eda_all

Data: Anscombe's quintet

Enbor Eda Article 2 V4 Dataset

Enbor Eda Article 2 V4

‘US Health Insurance Dataset’ analyzed by Analyst-2

Context

Content

Inspiration

Understanding Fatigue Through Biosignals: A Comprehensive Dataset

‘Groceries dataset for Market Basket Analysis(MBA)’ analyzed by Analyst-2

Surface Water Stations - MPCA Environmental Data Access

Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and...

Context and Methodology

Technical Details

Data Fields Description:

Software Requirements

Additional Resources

Cognitive Fatigue

Replication Package of Deep Learning and Data Augmentation for Detecting...

G-REx: A Real-World Dataset of Group Emotion Experiences based on...

‘IMDB Movies Dataset’ analyzed by Analyst-2

Context

Content

Inspiration

Impact of AI in Education Processes

Orange dataset table