100+ datasets found

E
Exploratory Data Analysis (EDA) Tools Report
marketreportanalytics.com
doc, pdf, ppt
Updated Apr 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Report Analytics (2025). Exploratory Data Analysis (EDA) Tools Report [Dataset]. https://www.marketreportanalytics.com/reports/exploratory-data-analysis-eda-tools-54164
Explore at:
ppt, pdf, docAvailable download formats
Dataset updated
Apr 2, 2025
Dataset authored and provided by
Market Report Analytics
License
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Exploratory Data Analysis (EDA) tools market is experiencing robust growth, driven by the increasing volume and complexity of data across various industries. The market, estimated at $1.5 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, reaching approximately $5 billion by 2033. This expansion is fueled by several key factors. Firstly, the rising adoption of big data analytics and business intelligence initiatives across large enterprises and SMEs is creating a significant demand for efficient EDA tools. Secondly, the growing need for faster, more insightful data analysis to support better decision-making is driving the preference for user-friendly graphical EDA tools over traditional non-graphical methods. Furthermore, advancements in artificial intelligence and machine learning are seamlessly integrating into EDA tools, enhancing their capabilities and broadening their appeal. The market segmentation reveals a significant portion held by large enterprises, reflecting their greater resources and data handling needs. However, the SME segment is rapidly gaining traction, driven by the increasing affordability and accessibility of cloud-based EDA solutions. Geographically, North America currently dominates the market, but regions like Asia-Pacific are exhibiting high growth potential due to increasing digitalization and technological advancements. Despite this positive outlook, certain restraints remain. The high initial investment cost associated with implementing advanced EDA solutions can be a barrier for some SMEs. Additionally, the need for skilled professionals to effectively utilize these tools can create a challenge for organizations. However, the ongoing development of user-friendly interfaces and the availability of training resources are actively mitigating these limitations. The competitive landscape is characterized by a mix of established players like IBM and emerging innovative companies offering specialized solutions. Continuous innovation in areas like automated data preparation and advanced visualization techniques will further shape the future of the EDA tools market, ensuring its sustained growth trajectory.
S
Global Exploratory Data Analysis (EDA) Tools Market Revenue Forecasts...
statsndata.org
excel, pdf
Updated Sep 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stats N Data (2025). Global Exploratory Data Analysis (EDA) Tools Market Revenue Forecasts 2025-2032 [Dataset]. https://www.statsndata.org/report/exploratory-data-analysis-eda-tools-market-313301
Explore at:
excel, pdfAvailable download formats
Dataset updated
Sep 2025
Dataset authored and provided by
Stats N Data
License
https://www.statsndata.org/how-to-orderhttps://www.statsndata.org/how-to-order
Area covered
Global
Description
Exploratory Data Analysis (EDA) Tools play a pivotal role in the modern data-driven landscape, transforming raw data into actionable insights. As businesses increasingly recognize the value of data in informing decisions, the market for EDA tools has witnessed substantial growth, driven by the rapid expansion of dat
Z
Data Analysis for the Systematic Literature Review of DL4SE
data.niaid.nih.gov
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586
Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
College of William and Mary
Washington and Lee University
Authors
Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise
D
Data Lens (Visualizations Of Data) Report
archivemarketresearch.com
doc, pdf, ppt
Updated Mar 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). Data Lens (Visualizations Of Data) Report [Dataset]. https://www.archivemarketresearch.com/reports/data-lens-visualizations-of-data-48718
Explore at:
ppt, pdf, docAvailable download formats
Dataset updated
Mar 6, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global market for data lens (visualizations of data) is experiencing robust growth, driven by the increasing adoption of data analytics across diverse industries. This market, estimated at $50 billion in 2025, is projected to achieve a compound annual growth rate (CAGR) of 15% from 2025 to 2033. This expansion is fueled by several key factors. Firstly, the rising volume and complexity of data necessitate effective visualization tools for insightful analysis. Businesses are increasingly relying on interactive dashboards and data storytelling techniques to derive actionable intelligence from their data, fostering the demand for sophisticated data visualization solutions. Secondly, advancements in artificial intelligence (AI) and machine learning (ML) are enhancing the capabilities of data visualization platforms, enabling automated insights generation and predictive analytics. This creates new opportunities for vendors to offer more advanced and user-friendly tools. Finally, the growing adoption of cloud-based solutions is further accelerating market growth, offering enhanced scalability, accessibility, and cost-effectiveness. The market is segmented across various types, including points, lines, and bars, and applications, ranging from exploratory data analysis and interactive data visualization to descriptive statistics and advanced data science techniques. Major players like Tableau, Sisense, and Microsoft dominate the market, constantly innovating to meet evolving customer needs and competitive pressures. The geographical distribution of the market reveals strong growth across North America and Europe, driven by early adoption and technological advancements. However, emerging markets in Asia-Pacific and the Middle East & Africa are showing significant growth potential, fueled by increasing digitalization and investment in data analytics infrastructure. Restraints to growth include the high cost of implementation, the need for skilled professionals to effectively utilize these tools, and security concerns related to data privacy. Nonetheless, the overall market outlook remains positive, with continued expansion anticipated throughout the forecast period due to the fundamental importance of data visualization in informed decision-making across all sectors.
d
Exploratory Data Analysis and the Future, with glue
search.dataone.org
dataverse.harvard.edu
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Goodman, Alyssa (2023). Exploratory Data Analysis and the Future, with glue [Dataset]. http://doi.org/10.7910/DVN/SQSNM4
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/SQSNM4
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Goodman, Alyssa
Description
Presentation Date: Sunday, January 8th, 2023 Location: Seattle, Washington, USA Abstract: A talk introducing glue software and its function with astronomy at the 2023 AAS meeting. Files included are Keynote slides (in .key and .pdf formats)
f
Exploratory data analysis.
figshare.com
datasetcatalog.nlm.nih.gov
+1more
xls
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oscar Ngesa; Henry Mwambi; Thomas Achia (2023). Exploratory data analysis. [Dataset]. http://doi.org/10.1371/journal.pone.0103299.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0103299.t001
Dataset updated
Jun 5, 2023
Dataset provided by
PLOS ONE
Authors
Oscar Ngesa; Henry Mwambi; Thomas Achia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Exploratory data analysis.
f
Data from: The Often-Overlooked Power of Summary Statistics in Exploratory...
acs.figshare.com
xlsx
Updated Jun 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tahereh G. Avval; Behnam Moeini; Victoria Carver; Neal Fairley; Emily F. Smith; Jonas Baltrusaitis; Vincent Fernandez; Bonnie. J. Tyler; Neal Gallagher; Matthew R. Linford (2023). The Often-Overlooked Power of Summary Statistics in Exploratory Data Analysis: Comparison of Pattern Recognition Entropy (PRE) to Other Summary Statistics and Introduction of Divided Spectrum-PRE (DS-PRE) [Dataset]. http://doi.org/10.1021/acs.jcim.1c00244.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.1c00244.s002
Dataset updated
Jun 8, 2023
Dataset provided by
ACS Publications
Authors
Tahereh G. Avval; Behnam Moeini; Victoria Carver; Neal Fairley; Emily F. Smith; Jonas Baltrusaitis; Vincent Fernandez; Bonnie. J. Tyler; Neal Gallagher; Matthew R. Linford
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Unsupervised exploratory data analysis (EDA) is often the first step in understanding complex data sets. While summary statistics are among the most efficient and convenient tools for exploring and describing sets of data, they are often overlooked in EDA. In this paper, we show multiple case studies that compare the performance, including clustering, of a series of summary statistics in EDA. The summary statistics considered here are pattern recognition entropy (PRE), the mean, standard deviation (STD), 1-norm, range, sum of squares (SSQ), and X4, which are compared with principal component analysis (PCA), multivariate curve resolution (MCR), and/or cluster analysis. PRE and the other summary statistics are direct methods for analyzing datathey are not factor-based approaches. To quantify the performance of summary statistics, we use the concept of the “critical pair,” which is employed in chromatography. The data analyzed here come from different analytical methods. Hyperspectral images, including one of a biological material, are also analyzed. In general, PRE outperforms the other summary statistics, especially in image analysis, although a suite of summary statistics is useful in exploring complex data sets. While PRE results were generally comparable to those from PCA and MCR, PRE is easier to apply. For example, there is no need to determine the number of factors that describe a data set. Finally, we introduce the concept of divided spectrum-PRE (DS-PRE) as a new EDA method. DS-PRE increases the discrimination power of PRE. We also show that DS-PRE can be used to provide the inputs for the k-nearest neighbor (kNN) algorithm. We recommend PRE and DS-PRE as rapid new tools for unsupervised EDA.
d
Physical Properties of Lakes: Exploratory Data Analysis
search.dataone.org
hydroshare.org
Updated Apr 15, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriela Garcia; Kateri Salk (2022). Physical Properties of Lakes: Exploratory Data Analysis [Dataset]. https://search.dataone.org/view/sha256%3A82a3bd46ad259724cad21b7a344728253ea4e6d929f6134e946c379585f903f6
Explore at:
Dataset updated
Apr 15, 2022
Dataset provided by
Hydroshare
Authors
Gabriela Garcia; Kateri Salk
Time period covered
May 27, 1984 - Aug 17, 2016
Area covered
Description
Exploratory Data Analysis for the Physical Properties of Lakes

This lesson was adapted from educational material written by Dr. Kateri Salk for her Fall 2019 Hydrologic Data Analysis course at Duke University. This is the first part of a two-part exercise focusing on the physical properties of lakes.

Introduction

Lakes are dynamic, nonuniform bodies of water in which the physical, biological, and chemical properties interact. Lakes also contain the majority of Earth's fresh water supply. This lesson introduces exploratory data analysis using R statistical software in the context of the physical properties of lakes.

Learning Objectives

After successfully completing this exercise, you will be able to:

Apply exploratory data analytics skills to applied questions about physical properties of lakes

Communicate findings with peers through oral, visual, and written modes
Impact of Artificial Intelligence on Education
kaggle.com
Updated Jun 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
INK (2025). Impact of Artificial Intelligence on Education [Dataset]. https://www.kaggle.com/datasets/irakozekelly/impact-of-artificial-intelligence-on-education/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 9, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
INK
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset supports a study examining how students perceive the usefulness of artificial intelligence (AI) in educational settings. The project involved analyzing an open-access survey dataset that captures a wide range of student responses on AI tools in learning.

The data underwent cleaning and preprocessing, followed by an exploratory data analysis (EDA) to identify key trends and insights. Visualizations were created to support interpretation, and the results were summarized in a digital poster format to communicate findings effectively.

This resource may be useful for researchers, educators, and technologists interested in the evolving role of AI in education.

Keywords: Artificial Intelligence, Education, Student Perception, Survey, Data Analysis, EDA Subject: Computer and Information Science License: CC0 1.0 Universal Public Domain Dedication DOI: https://doi.org/10.18738/T8/RXUCHK
Z
Dataset for "Machine learning predictions on an extensive geotechnical...
data.niaid.nih.gov
Updated Dec 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Soranzo, Enrico (2024). Dataset for "Machine learning predictions on an extensive geotechnical dataset of laboratory tests in Austria" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14251190
Explore at:
Dataset updated
Dec 5, 2024
Dataset authored and provided by
Soranzo, Enrico
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Austria
Description
This dataset comprises over 20 years of geotechnical laboratory testing data collected primarily from Vienna, Lower Austria, and Burgenland. It includes 24 features documenting critical soil properties derived from particle size distributions, Atterberg limits, Proctor tests, permeability tests, and direct shear tests. Locations for a subset of samples are provided, enabling spatial analysis.

The dataset is a valuable resource for geotechnical research and education, allowing users to explore correlations among soil parameters and develop predictive models. Examples of such correlations include liquidity index with undrained shear strength, particle size distribution with friction angle, and liquid limit and plasticity index with residual friction angle.

Python-based exploratory data analysis and machine learning applications have demonstrated the dataset's potential for predictive modeling, achieving moderate accuracy for parameters such as cohesion and friction angle. Its temporal and spatial breadth, combined with repeated testing, enhances its reliability and applicability for benchmarking and validating analytical and computational geotechnical methods.

This dataset is intended for researchers, educators, and practitioners in geotechnical engineering. Potential use cases include refining empirical correlations, training machine learning models, and advancing soil mechanics understanding. Users should note that preprocessing steps, such as imputation for missing values and outlier detection, may be necessary for specific applications.

Key Features:

Temporal Coverage: Over 20 years of data.

Geographical Coverage: Vienna, Lower Austria, and Burgenland.

Tests Included:

Particle Size Distribution

Atterberg Limits

Proctor Tests

Permeability Tests

Direct Shear Tests

Number of Variables: 24

Potential Applications: Correlation analysis, predictive modeling, and geotechnical design.

Technical Details:

Missing values have been addressed using K-Nearest Neighbors (KNN) imputation, and anomalies identified using Local Outlier Factor (LOF) methods in previous studies.

Data normalization and standardization steps are recommended for specific analyses.

Acknowledgments:The dataset was compiled with support from the European Union's MSCA Staff Exchanges project 101182689 Geotechnical Resilience through Intelligent Design (GRID).
f
ftmsRanalysis: An R package for exploratory data analysis and interactive...
plos.figshare.com
datasetcatalog.nlm.nih.gov
xlsx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lisa M. Bramer; Amanda M. White; Kelly G. Stratton; Allison M. Thompson; Daniel Claborne; Kirsten Hofmockel; Lee Ann McCue (2023). ftmsRanalysis: An R package for exploratory data analysis and interactive visualization of FT-MS data [Dataset]. http://doi.org/10.1371/journal.pcbi.1007654
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1007654
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS Computational Biology
Authors
Lisa M. Bramer; Amanda M. White; Kelly G. Stratton; Allison M. Thompson; Daniel Claborne; Kirsten Hofmockel; Lee Ann McCue
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The high-resolution and mass accuracy of Fourier transform mass spectrometry (FT-MS) has made it an increasingly popular technique for discerning the composition of soil, plant and aquatic samples containing complex mixtures of proteins, carbohydrates, lipids, lignins, hydrocarbons, phytochemicals and other compounds. Thus, there is a growing demand for informatics tools to analyze FT-MS data that will aid investigators seeking to understand the availability of carbon compounds to biotic and abiotic oxidation and to compare fundamental chemical properties of complex samples across groups. We present ftmsRanalysis, an R package which provides an extensive collection of data formatting and processing, filtering, visualization, and sample and group comparison functionalities. The package provides a suite of plotting methods and enables expedient, flexible and interactive visualization of complex datasets through functions which link to a powerful and interactive visualization user interface, Trelliscope. Example analysis using FT-MS data from a soil microbiology study demonstrates the core functionality of the package and highlights the capabilities for producing interactive visualizations.
f
Data_Sheet_1_Mind the Queue: A Case Study in Visualizing Heterogeneous...
frontiersin.figshare.com
figshare.com
zip
Updated Jun 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Catherine McVey; Fushing Hsieh; Diego Manriquez; Pablo Pinedo; Kristina Horback (2023). Data_Sheet_1_Mind the Queue: A Case Study in Visualizing Heterogeneous Behavioral Patterns in Livestock Sensor Data Using Unsupervised Machine Learning Techniques.ZIP [Dataset]. http://doi.org/10.3389/fvets.2020.00523.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.3389/fvets.2020.00523.s001
Dataset updated
Jun 4, 2023
Dataset provided by
Frontiers
Authors
Catherine McVey; Fushing Hsieh; Diego Manriquez; Pablo Pinedo; Kristina Horback
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Sensor technologies allow ethologists to continuously monitor the behaviors of large numbers of animals over extended periods of time. This creates new opportunities to study livestock behavior in commercial settings, but also new methodological challenges. Densely sampled behavioral data from large heterogeneous groups can contain a range of complex patterns and stochastic structures that may be difficult to visualize using conventional exploratory data analysis techniques. The goal of this research was to assess the efficacy of unsupervised machine learning tools in recovering complex behavioral patterns from such datasets to better inform subsequent statistical modeling. This methodological case study was carried out using records on milking order, or the sequence in which cows arrange themselves as they enter the milking parlor. Data was collected over a 6-month period from a closed group of 200 mixed-parity Holstein cattle on an organic dairy. Cows at the front and rear of the queue proved more consistent in their entry position than animals at the center of the queue, a systematic pattern of heterogeneity more clearly visualized using entropy estimates, a scale and distribution-free alternative to variance robust to outliers. Dimension reduction techniques were then used to visualize relationships between cows. No evidence of social cohesion was recovered, but Diffusion Map embeddings proved more adept than PCA at revealing the underlying linear geometry of this data. Median parlor entry positions from the pre- and post-pasture subperiods were highly correlated (R = 0.91), suggesting a surprising degree of temporal stationarity. Data Mechanics visualizations, however, revealed heterogeneous non-stationary among subgroups of animals in the center of the group and herd-level temporal outliers. A repeated measures model recovered inconsistent evidence of a relationships between entry position and cow attributes. Mutual conditional entropy tests, a permutation-based approach to assessing bivariate correlations robust to non-independence, confirmed a significant but non-linear association with peak milk yield, but revealed the age effect to be potentially confounded by health status. Finally, queueing records were related back to behaviors recorded via ear tag accelerometers using linear models and mutual conditional entropy tests. Both approaches recovered consistent evidence of differences in home pen behaviors across subsections of the queue.
bank-full.csv (Ensemble Techniques)
kaggle.com
Updated Apr 23, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kranti Walke (2020). bank-full.csv (Ensemble Techniques) [Dataset]. https://www.kaggle.com/krantiswalke/bankfullcsv/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 23, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kranti Walke
Description
Data Description: The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

Domain: Banking

Context: Leveraging customer information is paramount for most businesses. In the case of a bank, attributes of customers like the ones mentioned below can be crucial in strategizing a marketing campaign when launching a new product.

Learning Outcomes: ● Exploratory Data Analysis ● Preparing the data to train a model ● Training and making predictions using an Ensemble Model ● Comparing model performances

Objective: The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

Steps and tasks: 1. Import the necessary libraries 2. Read the data as a data frame 3. Perform basic EDA which should include the following and print out your insights at every step. a. Shape of the data b. Data type of each attribute c. Checking the presence of missing values d. 5 point summary of numerical attributes e. Checking the presence of outliers 4. Prepare the data to train a model – check if data types are appropriate, get rid of the missing values etc 5. Train a few standard classification algorithms, note and comment on their performances along different metrics. 6. Build the ensemble models and compare the results with the base models. Note: Random forest can be used only with Decision trees. 7. Compare performances of all the models

References: ● Data analytics use cases in Banking ● Machine Learning for Financial Marketing
O
Analytic_Provenance
opendatalab.com
zip
Updated Jan 17, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Texas A&M University (2018). Analytic_Provenance [Dataset]. https://opendatalab.com/OpenDataLab/Analytic_Provenance
Explore at:
zip(321803532 bytes)Available download formats
Dataset updated
Jan 17, 2018
Dataset provided by
Texas A&M University
Description
Analytic provenance is a data repository that can be used to study human analysis activity, thought processes, and software interaction with visual analysis tools during exploratory data analysis. It was collected during a series of user studies involving exploratory data analysis scenario with textual and cyber security data. Interactions logs, think-alouds, videos and all coded data in this study are available online for research purposes. Analysis sessions are segmented in multiple sub-task steps based on user think-alouds, video and audios captured during the studies. These analytic provenance datasets can be used for research involving tools and techniques for analyzing interaction logs and analysis history.
n
Data from: Research and exploratory analysis driven - time-data...
data.niaid.nih.gov
datadryad.org
zip
Updated Jan 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Del Gaizo; Kenneth Catchpole; Alexander Alekseyenko (2022). Research and exploratory analysis driven - time-data visualization (read-tv) software [Dataset]. http://doi.org/10.5061/dryad.d51c5b02g
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.d51c5b02g
Dataset updated
Jan 30, 2022
Dataset provided by
Medical University of South Carolina
Authors
John Del Gaizo; Kenneth Catchpole; Alexander Alekseyenko
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
read-tv

The main paper is about, read-tv, open-source software for longitudinal data visualization. We uploaded sample use case surgical flow disruption data to highlight read-tv's capabilities. We scrubbed the data of protected health information, and uploaded it as a single CSV file. A description of the original data is described below.

Data source

Surgical workflow disruptions, defined as “deviations from the natural progression of an operation thereby potentially compromising the efficiency or safety of care”, provide a window on the systems of work through which it is possible to analyze mismatches between the work demands and the ability of the people to deliver the work. They have been shown to be sensitive to different intraoperative technologies, surgical errors, surgical experience, room layout, checklist implementation and the effectiveness of the supporting team. The significance of flow disruptions lies in their ability to provide a hitherto unavailable perspective on the quality and efficiency of the system. This allows for a systematic, quantitative and replicable assessment of risks in surgical systems, evaluation of interventions to address them, and assessment of the role that technology plays in exacerbation or mitigation.

In 2014, Drs Catchpole and Anger were awarded NIBIB R03 EB017447 to investigate flow disruptions in Robotic Surgery which has resulted in the detailed, multi-level analysis of over 4,000 flow disruptions. Direct observation of 89 RAS (robitic assisted surgery) cases, found a mean of 9.62 flow disruptions per hour, which varies across different surgical phases, predominantly caused by coordination, communication, equipment, and training problems.

Methods This section does not describe the methods of read-tv software development, which can be found in the associated manuscript from JAMIA Open (JAMIO-2020-0121.R1). This section describes the methods involved in the surgical work flow disruption data collection. A curated, PHI-free (protected health information) version of this dataset was used as a use case for this manuscript.

Observer training

Trained human factors researchers conducted each observation following the completion of observer training. The researchers were two full-time research assistants based in the department of surgery at site 3 who visited the other two sites to collect data. Human Factors experts guided and trained each observer in the identification and standardized collection of FDs. The observers were also trained in the basic components of robotic surgery in order to be able to tangibly isolate and describe such disruptive events.

Comprehensive observer training was ensured with both classroom and floor training. Observers were required to review relevant literature, understand general practice guidelines for observing in the OR (e.g., where to stand, what to avoid, who to speak to), and conduct practice observations. The practice observations were broken down into three phases, all performed under the direct supervision of an experienced observer. During phase one, the trainees oriented themselves to the real-time events of both the OR and the general steps in RAS. The trainee was also introduced to the OR staff and any other involved key personnel. During phase two, the trainer and trainee observed three RAS procedures together to practice collecting FDs and become familiar with the data collection tool. Phase three was dedicated to determining inter-rater reliability by having the trainer and trainee simultaneously, yet independently, conduct observations for at least three full RAS procedures. Observers were considered fully trained if, after three full case observations, intra-class correlation coefficients (based on number of observed disruptions per phase) were greater than 0.80, indicating good reliability.

Data collection

Following the completion of training, observers individually conducted observations in the OR. All relevant RAS cases were pre-identified on a monthly basis by scanning the surgical schedule and recording a list of procedures. All procedures observed were conducted with the Da Vinci Xi surgical robot, with the exception of one procedure at Site 2, which was performed with the Si robot. Observers attended those cases that fit within their allotted work hours and schedule. Observers used Microsoft Surface Pro tablets configured with a customized data collection tool developed using Microsoft Excel to collect data. The data collection tool divided procedures into five phases, as opposed to the four phases previously used in similar research, to more clearly distinguish between task demands throughout the procedure. Phases consisted of phase 1 - patient in the room to insufflation, phase 2 -insufflation to surgeon on console (including docking), phase 3 - surgeon on console to surgeon off console, phase 4 - surgeon off console to patient closure, and phase 5 - patient closure to patient leaves the operating room. During each procedure, FDs were recorded into the appropriate phase, and a narrative, time-stamp, and classification (based off of a robot-specific FD taxonomy) were also recorded.

Each FD was categorized into one of ten categories: communication, coordination, environment, equipment, external factors, other, patient factors, surgical task considerations, training, or unsure. The categorization system is modeled after previous studies, as well as the examples provided for each FD category.

Once in the OR, observers remained as unobtrusive as possible. They stood at an appropriate vantage point in the room without getting in the way of team members. Once an appropriate time presented itself, observers introduced themselves to the circulating nurse and informed them of the reason for their presence. Observers did not directly engage in conversations with operating room staff, however, if a staff member approached them with any questions/comments they would respond.

Data Reduction and PHI (Protected Health Information) Removal

This dataset uses 41 of the aforementioned surgeries. All columns have been removed except disruption type, a numeric timestamp for number of minutes into the day, and surgical phase. In addition, each surgical case had it's initial disruption set to 12 noon, (720 minutes).
"Big Basket" google play app reviews for basic NLP
kaggle.com
Updated Sep 15, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Apurva Varshney (2020). "Big Basket" google play app reviews for basic NLP [Dataset]. https://www.kaggle.com/apurvavarshney/big-basket-google-play-app-reviews-for-basic-nlp/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 15, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Apurva Varshney
Description
Dataset

This dataset was created by Apurva Varshney

Contents
f
Data from: Functional Time Series Analysis and Visualization Based on...
tandf.figshare.com
datasetcatalog.nlm.nih.gov
pdf
Updated Sep 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Israel Martínez-Hernández; Marc G. Genton (2024). Functional Time Series Analysis and Visualization Based on Records [Dataset]. http://doi.org/10.6084/m9.figshare.26207477.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26207477.v1
Dataset updated
Sep 19, 2024
Dataset provided by
Taylor & Francis
Authors
Israel Martínez-Hernández; Marc G. Genton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In many phenomena, data are collected on a large scale and at different frequencies. In this context, functional data analysis (FDA) has become an important statistical methodology for analyzing and modeling such data. The approach of FDA is to assume that data are continuous functions and that each continuous function is considered as a single observation. Thus, FDA deals with large-scale and complex data. However, visualization and exploratory data analysis, which are very important in practice, can be challenging due to the complexity of the continuous functions. Here we introduce a type of record concept for functional data, and we propose some nonparametric tools based on the record concept for functional data observed over time (functional time series). We study the properties of the trajectory of the number of record curves under different scenarios. Also, we propose a unit root test based on the number of records. The trajectory of the number of records over time and the unit root test can be used for visualization and exploratory data analysis. We illustrate the advantages of our proposal through a Monte Carlo simulation study. We also illustrate our method on two different datasets: Daily wind speed curves at Yanbu, Saudi Arabia and annual mortality rates in France. Overall, we can identify the type of functional time series being studied based on the number of record curves observed. Supplementary materials for this article are available online.
r
Investigation of the machine learning method Random Survival Forest as an...
resodate.org
Updated Sep 19, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stefan Dietrich (2016). Investigation of the machine learning method Random Survival Forest as an exploratory analysis tool for the identification of variables associated with disease risks in complex survival data [Dataset]. http://doi.org/10.14279/depositonce-5498
Explore at:
Unique identifier
https://doi.org/10.14279/depositonce-5498
Dataset updated
Sep 19, 2016
Dataset provided by
Technische Universität Berlin
DepositOnce
Authors
Stefan Dietrich
Description
The containment of the global epidemic increase of chronic diseases represents a major objective of health care systems worldwide. However, the fulfillment of this objective is complicated by the multifactorial origin of many frequent chronic diseases. Comprehensive investigations are necessary to grasp the complexity of the pathophysiological mechanisms of chronic diseases. However, this frequently results in the acquisition of complex data with numerous highly correlated variables. The statistical analysis of such complex data to identify disease associated markers is a daunting challenge. In general the application of regression methods to complex data is accompanied by problems of multiple testing and of multicollinearity. A promising approach for the survival time analysis of complex data represents the machine learning method Random Survival Forest (RSF).
Against this background, the present thesis aimed to evaluate the applicability of RSF for survival analysis of complex data in the European Prospective Investigation into Cancer and Nutrition (EPIC)-Potsdam study. A RSF backward selection algorithm was developed for the purpose of variable selection. A simulation study was then performed to evaluate the RSF method and the RSF backward algorithm. Subsequently, the RSF backward algorithm was applied to prospective observational data of the EPIC-Potsdam study to identify metabolites associated with incident T2D and to identify food groups associated with incident hypertension. The conducted simulation study confirmed the suitability of the RSF method and the implemented RSF backward algorithm as a tool for variable selection. It was demonstrated that the RSF method is able to identify predictive variables while taking into account possible confounders and can handle also the problem of multicollinearity. The subsequent application of the RSF backward algorithm to data of the EPIC-Potsdam study resulted in the successful identification of several metabolites and food groups which were associated with incident T2D and incident hypertension, respectively. Beside hexose, the metabolite diacyl-phosphatidylcholine (PC) C38:3, acyl-alkyl-PC C34:4, the amino acids valine, tyrosine, and glycine and a correlation pattern of five acyl-alkyl-PC and two diacyl-PC were associated with the incidence of T2D. Regarding the incidence of hypertension, a lunch and dinner pattern was most informative in women. In addition, a pattern reflecting dairy fat and cheese consumption and the consumption of spirits were also associated with incident hypertension in women and men. By using partial plots the direction of non-linear associations between identified variables and incident T2D and hypertension were visualised which enhanced the interpretability of the findings. In conclusion, the findings of the present thesis demonstrated that the RSF method and the implemented RSF backward algorithm represent a sensible complement to existing survival analysis methods. The RSF backward algorithm is particularly useful for exploratory analysis of complex survival data to identify unknown biomarkers associated with time until event of interest. However, the verification of the implemented RSF backward algorithm and of the present findings in external cohorts as well as the translation of the present findings for clinical diagnosis, prevention strategies and dietary recommendations should be a matter for future research.
NCAA 100 Freestyle 2015-2024
kaggle.com
Updated Feb 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Justin R111 (2025). NCAA 100 Freestyle 2015-2024 [Dataset]. https://www.kaggle.com/datasets/justinr111/ncaa-100-freestyle-2015-2024/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 16, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Justin R111
Description
This dataset contains race data from the past ten years of NCAA for the 100 freestyle (men) event. I collected this data using my own Python Script in which you follow along with a race by pressing the "Enter" button with each stroke. Upon the completion of the script, csv and pdf files are generated containing data from the race. I aggregated this data for the completion of my first project.

In order to aggregate, organize, and visualize the data, I had to use a variety of software such as BigQuery (SQL), Python, Tableau, and Google Sheets. This project shows my ability to use a variety of different tools used for data analysis.
Credit Card Exploratory Data Analysis
kaggle.com
Updated Nov 17, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Darpan Bajaj (2019). Credit Card Exploratory Data Analysis [Dataset]. https://www.kaggle.com/datasets/darpan25bajaj/credit-card-exploratory-data-analysis/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 17, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Darpan Bajaj
Description
ANALYTICS IN CREDIT CARD INDUSTRY:

Analytics has penetrated every industry owing to the various technology platforms that collect information and thus, the service providers know what exactly customers want. The Credit Card industry is no exception. Within credit card payment processing, there is a significant amount of data available that can be beneficial in countless ways.

Understanding the customer behaviour

The data available from a credit card processor identifies the types of consumer and their business spending behaviors. Hence, developing the marketing campaigns to directly address their behaviors indeed grows the revenue and these considerations will result in greater sales.

BUSINESS PROBLEM

In order to effectively produce quality decisions in the modern credit card industry, knowledge must be gained through effective data analysis and modeling. Through the use of dynamic datadriven decision-making tools and procedures, information can be gathered to successfully evaluate all aspects of credit card operations. PSPD Bank has banking operations in more than 50 countries across the globe. Mr. Jim Watson, CEO, wants to evaluate areas of bankruptcy, fraud, and collections, respond to customer requests for help with proactive offers and service.

About the Data

This book has the following sheets:

Customer Acquisition: At the time of card issuing, company maintains the details of customers.

Spend (Transaction data): Credit card spend for each customer

Repayment: Credit card Payment done by customer

What can be done with the data?

Create a report and display the calculated metrics, reports and inferences.

Facebook

Twitter

Click to copy link

Link copied

Cite

Market Report Analytics (2025). Exploratory Data Analysis (EDA) Tools Report [Dataset]. https://www.marketreportanalytics.com/reports/exploratory-data-analysis-eda-tools-54164

Exploratory Data Analysis (EDA) Tools Report

Explore at:

ppt, pdf, docAvailable download formats

Dataset updated

Apr 2, 2025

Dataset authored and provided by

Market Report Analytics

License

https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

Time period covered

2025 - 2033

Area covered

Global

Variables measured

Market Size

Description

The Exploratory Data Analysis (EDA) tools market is experiencing robust growth, driven by the increasing volume and complexity of data across various industries. The market, estimated at $1.5 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, reaching approximately $5 billion by 2033. This expansion is fueled by several key factors. Firstly, the rising adoption of big data analytics and business intelligence initiatives across large enterprises and SMEs is creating a significant demand for efficient EDA tools. Secondly, the growing need for faster, more insightful data analysis to support better decision-making is driving the preference for user-friendly graphical EDA tools over traditional non-graphical methods. Furthermore, advancements in artificial intelligence and machine learning are seamlessly integrating into EDA tools, enhancing their capabilities and broadening their appeal. The market segmentation reveals a significant portion held by large enterprises, reflecting their greater resources and data handling needs. However, the SME segment is rapidly gaining traction, driven by the increasing affordability and accessibility of cloud-based EDA solutions. Geographically, North America currently dominates the market, but regions like Asia-Pacific are exhibiting high growth potential due to increasing digitalization and technological advancements. Despite this positive outlook, certain restraints remain. The high initial investment cost associated with implementing advanced EDA solutions can be a barrier for some SMEs. Additionally, the need for skilled professionals to effectively utilize these tools can create a challenge for organizations. However, the ongoing development of user-friendly interfaces and the availability of training resources are actively mitigating these limitations. The competitive landscape is characterized by a mix of established players like IBM and emerging innovative companies offering specialized solutions. Continuous innovation in areas like automated data preparation and advanced visualization techniques will further shape the future of the EDA tools market, ensuring its sustained growth trajectory.

Clear search

Close search

Google apps

Main menu

Exploratory Data Analysis (EDA) Tools Report

Global Exploratory Data Analysis (EDA) Tools Market Revenue Forecasts...

Data Analysis for the Systematic Literature Review of DL4SE

Data Lens (Visualizations Of Data) Report

Exploratory Data Analysis and the Future, with glue

Exploratory data analysis.

Data from: The Often-Overlooked Power of Summary Statistics in Exploratory...

Physical Properties of Lakes: Exploratory Data Analysis

Impact of Artificial Intelligence on Education

Dataset for "Machine learning predictions on an extensive geotechnical...

ftmsRanalysis: An R package for exploratory data analysis and interactive...

Data_Sheet_1_Mind the Queue: A Case Study in Visualizing Heterogeneous...

bank-full.csv (Ensemble Techniques)

Analytic_Provenance

Data from: Research and exploratory analysis driven - time-data...

"Big Basket" google play app reviews for basic NLP

Dataset

Contents

Data from: Functional Time Series Analysis and Visualization Based on...

Investigation of the machine learning method Random Survival Forest as an...

NCAA 100 Freestyle 2015-2024

Credit Card Exploratory Data Analysis

ANALYTICS IN CREDIT CARD INDUSTRY:

Understanding the customer behaviour

BUSINESS PROBLEM

About the Data

What can be done with the data?

Exploratory Data Analysis (EDA) Tools ReportSee More Versions

Exploratory Data Analysis (EDA) Tools Report