96 datasets found

Ecommerce Dataset for Data Analysis
kaggle.com
zip
Updated Sep 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shrishti Manja (2024). Ecommerce Dataset for Data Analysis [Dataset]. https://www.kaggle.com/datasets/shrishtimanja/ecommerce-dataset-for-data-analysis/code
Explore at:
zip(2028853 bytes)Available download formats
Dataset updated
Sep 19, 2024
Authors
Shrishti Manja
Description
This dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.

About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.

Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.

This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.

This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning
E
Exploratory Data Analysis (EDA) Tools Report
marketreportanalytics.com
doc, pdf, ppt
Updated Apr 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Report Analytics (2025). Exploratory Data Analysis (EDA) Tools Report [Dataset]. https://www.marketreportanalytics.com/reports/exploratory-data-analysis-eda-tools-54164
Explore at:
ppt, pdf, docAvailable download formats
Dataset updated
Apr 2, 2025
Dataset authored and provided by
Market Report Analytics
License
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Exploratory Data Analysis (EDA) tools market is experiencing robust growth, driven by the increasing volume and complexity of data across various industries. The market, estimated at $1.5 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, reaching approximately $5 billion by 2033. This expansion is fueled by several key factors. Firstly, the rising adoption of big data analytics and business intelligence initiatives across large enterprises and SMEs is creating a significant demand for efficient EDA tools. Secondly, the growing need for faster, more insightful data analysis to support better decision-making is driving the preference for user-friendly graphical EDA tools over traditional non-graphical methods. Furthermore, advancements in artificial intelligence and machine learning are seamlessly integrating into EDA tools, enhancing their capabilities and broadening their appeal. The market segmentation reveals a significant portion held by large enterprises, reflecting their greater resources and data handling needs. However, the SME segment is rapidly gaining traction, driven by the increasing affordability and accessibility of cloud-based EDA solutions. Geographically, North America currently dominates the market, but regions like Asia-Pacific are exhibiting high growth potential due to increasing digitalization and technological advancements. Despite this positive outlook, certain restraints remain. The high initial investment cost associated with implementing advanced EDA solutions can be a barrier for some SMEs. Additionally, the need for skilled professionals to effectively utilize these tools can create a challenge for organizations. However, the ongoing development of user-friendly interfaces and the availability of training resources are actively mitigating these limitations. The competitive landscape is characterized by a mix of established players like IBM and emerging innovative companies offering specialized solutions. Continuous innovation in areas like automated data preparation and advanced visualization techniques will further shape the future of the EDA tools market, ensuring its sustained growth trajectory.
E
Exploratory Data Analysis (EDA) Tools Report
archivemarketresearch.com
doc, pdf, ppt
Updated Feb 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). Exploratory Data Analysis (EDA) Tools Report [Dataset]. https://www.archivemarketresearch.com/reports/exploratory-data-analysis-eda-tools-21680
Explore at:
doc, pdf, pptAvailable download formats
Dataset updated
Feb 12, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global Exploratory Data Analysis (EDA) Tools market is anticipated to experience significant growth in the coming years, driven by the increasing adoption of data-driven decision-making and the growing need for efficient data exploration and analysis. The market size is valued at USD XX million in 2025 and is projected to reach USD XX million by 2033, registering a CAGR of XX% during the forecast period. The increasing complexity and volume of data generated by businesses and organizations have necessitated the use of advanced data analysis tools to derive meaningful insights and make informed decisions. Key trends driving the market include the rising adoption of AI and machine learning technologies, the growing need for self-service data analytics, and the increasing emphasis on data visualization and storytelling. Non-graphical EDA tools are gaining traction due to their ability to handle large and complex datasets. Graphical EDA tools are preferred for their intuitive and interactive user interfaces that simplify data exploration. Large enterprises are major consumers of EDA tools as they have large volumes of data to analyze. SMEs are also increasingly adopting EDA tools as they realize the importance of data-driven insights for business growth. The North American region holds a significant market share due to the presence of established technology companies and a high adoption rate of data analytics solutions. The Asia Pacific region is expected to witness substantial growth due to the rising number of businesses and organizations in emerging economies.
Exploratory Data Analysis | EDA - use case
kaggle.com
zip
Updated Feb 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohinur Abdurahimova (2022). Exploratory Data Analysis | EDA - use case [Dataset]. https://www.kaggle.com/datasets/mohinurabdurahimova/exploratory-data-analysis-eda-use-case
Explore at:
zip(49796 bytes)Available download formats
Dataset updated
Feb 11, 2022
Authors
Mohinur Abdurahimova
Description
Dataset

This dataset was created by Mohinur Abdurahimova

Released under Data files © Original Authors

Contents
f
DataSheet1_Exploratory data analysis (EDA) machine learning approaches for...
frontiersin.figshare.com
docx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Victoria Da Poian; Bethany Theiling; Lily Clough; Brett McKinney; Jonathan Major; Jingyi Chen; Sarah Hörst (2023). DataSheet1_Exploratory data analysis (EDA) machine learning approaches for ocean world analog mass spectrometry.docx [Dataset]. http://doi.org/10.3389/fspas.2023.1134141.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fspas.2023.1134141.s001
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Victoria Da Poian; Bethany Theiling; Lily Clough; Brett McKinney; Jonathan Major; Jingyi Chen; Sarah Hörst
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
World
Description
Many upcoming and proposed missions to ocean worlds such as Europa, Enceladus, and Titan aim to evaluate their habitability and the existence of potential life on these moons. These missions will suffer from communication challenges and technology limitations. We review and investigate the applicability of data science and unsupervised machine learning (ML) techniques on isotope ratio mass spectrometry data (IRMS) from volatile laboratory analogs of Europa and Enceladus seawaters as a case study for development of new strategies for icy ocean world missions. Our driving science goal is to determine whether the mass spectra of volatile gases could contain information about the composition of the seawater and potential biosignatures. We implement data science and ML techniques to investigate what inherent information the spectra contain and determine whether a data science pipeline could be designed to quickly analyze data from future ocean worlds missions. In this study, we focus on the exploratory data analysis (EDA) step in the analytics pipeline. This is a crucial unsupervised learning step that allows us to understand the data in depth before subsequent steps such as predictive/supervised learning. EDA identifies and characterizes recurring patterns, significant correlation structure, and helps determine which variables are redundant and which contribute to significant variation in the lower dimensional space. In addition, EDA helps to identify irregularities such as outliers that might be due to poor data quality. We compared dimensionality reduction methods Uniform Manifold Approximation and Projection (UMAP) and Principal Component Analysis (PCA) for transforming our data from a high-dimensional space to a lower dimension, and we compared clustering algorithms for identifying data-driven groups (“clusters”) in the ocean worlds analog IRMS data and mapping these clusters to experimental conditions such as seawater composition and CO2 concentration. Such data analysis and characterization efforts are the first steps toward the longer-term science autonomy goal where similar automated ML tools could be used onboard a spacecraft to prioritize data transmissions for bandwidth-limited outer Solar System missions.
Exploratory data analysis of a clinical study group: Development of a...
plos.figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bogumil M. Konopka; Felicja Lwow; Magdalena Owczarz; Łukasz Łaczmański (2023). Exploratory data analysis of a clinical study group: Development of a procedure for exploring multidimensional data [Dataset]. http://doi.org/10.1371/journal.pone.0201950
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0201950
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Bogumil M. Konopka; Felicja Lwow; Magdalena Owczarz; Łukasz Łaczmański
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Thorough knowledge of the structure of analyzed data allows to form detailed scientific hypotheses and research questions. The structure of data can be revealed with methods for exploratory data analysis. Due to multitude of available methods, selecting those which will work together well and facilitate data interpretation is not an easy task. In this work we present a well fitted set of tools for a complete exploratory analysis of a clinical dataset and perform a case study analysis on a set of 515 patients. The proposed procedure comprises several steps: 1) robust data normalization, 2) outlier detection with Mahalanobis (MD) and robust Mahalanobis distances (rMD), 3) hierarchical clustering with Ward’s algorithm, 4) Principal Component Analysis with biplot vectors. The analyzed set comprised elderly patients that participated in the PolSenior project. Each patient was characterized by over 40 biochemical and socio-geographical attributes. Introductory analysis showed that the case-study dataset comprises two clusters separated along the axis of sex hormone attributes. Further analysis was carried out separately for male and female patients. The most optimal partitioning in the male set resulted in five subgroups. Two of them were related to diseased patients: 1) diabetes and 2) hypogonadism patients. Analysis of the female set suggested that it was more homogeneous than the male dataset. No evidence of pathological patient subgroups was found. In the study we showed that outlier detection with MD and rMD allows not only to identify outliers, but can also assess the heterogeneity of a dataset. The case study proved that our procedure is well suited for identification and visualization of biologically meaningful patient subgroups.
Z
Data Analysis for the Systematic Literature Review of DL4SE
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586
Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
College of William and Mary
Washington and Lee University
Authors
Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise
Advanced exploratory data analysis (EDA)
kaggle.com
zip
Updated Nov 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mustafa Ghzi (2023). Advanced exploratory data analysis (EDA) [Dataset]. https://www.kaggle.com/datasets/mustafaghzi/advanced-exploratory-data-analysis-eda
Explore at:
zip(4818027 bytes)Available download formats
Dataset updated
Nov 18, 2023
Authors
Mustafa Ghzi
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Dataset

This dataset was created by Mustafa Ghzi

Released under CC BY-NC-SA 4.0

Contents
d
Exploratory Data Analysis of Airbnb Data
dataone.org
borealisdata.ca
Updated Dec 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmad, Imad; Rasheed, Ibtassam; Man, Yip Chi (2023). Exploratory Data Analysis of Airbnb Data [Dataset]. http://doi.org/10.5683/SP3/F2OCZF
Explore at:
Unique identifier
https://doi.org/10.5683/SP3/F2OCZF
Dataset updated
Dec 28, 2023
Dataset provided by
Borealis
Authors
Ahmad, Imad; Rasheed, Ibtassam; Man, Yip Chi
Description
Airbnb® is an American company operating an online marketplace for lodging, primarily for vacation rentals. The purpose of this study is to perform an exploratory data analysis of the two datasets containing Airbnb® listings and across 10 major cities. We aim to use various data visualizations to gain valuable insight on the effects of pricing, covid, and more!
f
Data from: FactExplorer: Fact Embedding-Based Exploratory Data Analysis for...
tandf.figshare.com
pdf
Updated Sep 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qi Jiang; Guodao Sun; Yue Dong; Lvhan Pan; Baofeng Chang; Li Jiang; Haoran Liang; Ronghua Liang (2025). FactExplorer: Fact Embedding-Based Exploratory Data Analysis for Tabular Data [Dataset]. http://doi.org/10.6084/m9.figshare.28399639.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28399639.v1
Dataset updated
Sep 23, 2025
Dataset provided by
Taylor & Francis
Authors
Qi Jiang; Guodao Sun; Yue Dong; Lvhan Pan; Baofeng Chang; Li Jiang; Haoran Liang; Ronghua Liang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Despite exploratory data analysis (EDA) is a powerful approach for uncovering insights from unfamiliar datasets, existing EDA tools face challenges in assisting users to assess the progress of exploration and synthesize coherent insights from isolated findings. To address these challenges, we present FactExplorer, a novel fact-based EDA system that shifts the analysis focus from raw data to data facts. FactExplorer employs a hybrid logical-visual representation, providing users with a comprehensive overview of all potential facts at the outset of their exploration. Moreover, FactExplorer introduces fact-mining techniques, including topic-based drill-down and transition path search capabilities. These features facilitate in-depth analysis of facts and enhance the understanding of interconnections between specific facts. Finally, we present a usage scenario and conduct a user study to assess the effectiveness of FactExplorer. The results indicate that FactExplorer facilitates the understanding of isolated findings and enables users to steer a thorough and effective EDA.
AI Use Case - EDA
ai.tracebloc.io
json
Updated Nov 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
tracebloc (2025). AI Use Case - EDA [Dataset]. https://ai.tracebloc.io/explore/support-pilots-in-the-approach-and-landing-phase?tab=exploratory-data-analysis
Explore at:
jsonAvailable download formats
Dataset updated
Nov 18, 2025
Dataset provided by
Tracebloc GmbH
Authors
tracebloc
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Missing Values
Measurement technique
Statistical and exploratory data analysis
Description
Comprehensive exploratory data analysis
q
Biobyte 2 - Exploratory data analysis
qubeshub.org
Updated Aug 14, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sam Donovan (2019). Biobyte 2 - Exploratory data analysis [Dataset]. http://doi.org/10.25334/9W94-0F23
Explore at:
Unique identifier
https://doi.org/10.25334/9W94-0F23
Dataset updated
Aug 14, 2019
Dataset provided by
QUBES
Authors
Sam Donovan
Description
This short activity can be used to introduce the concept of exploratory data analysis and get participants to think about how this data science strategy is complementary to having students interpret graphs.
Data from: Penguins Go Parallel: A Grammar of Graphics Framework for...
tandf.figshare.com
txt
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Susan VanderPlas; Yawei Ge; Antony Unwin; Heike Hofmann (2023). Penguins Go Parallel: A Grammar of Graphics Framework for Generalized Parallel Coordinate Plots [Dataset]. http://doi.org/10.6084/m9.figshare.22467369.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22467369.v2
Dataset updated
Jun 2, 2023
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Susan VanderPlas; Yawei Ge; Antony Unwin; Heike Hofmann
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Parallel Coordinate Plots (PCP) are a valuable tool for exploratory data analysis of high-dimensional numerical data. The use of PCPs is limited when working with categorical variables or a mix of categorical and continuous variables. In this article, we propose Generalized Parallel Coordinate Plots (GPCP) to extend the ability of PCPs from just numeric variables to dealing seamlessly with a mix of categorical and numeric variables in a single plot. In this process we find that existing solutions for categorical values only, such as hammock plots or parsets become edge cases in the new framework. By focusing on individual observations rather than a marginal frequency we gain additional flexibility. The resulting approach is implemented in the R package ggpcp. Supplementary materials for this article are available online.
f
Data from: The Often-Overlooked Power of Summary Statistics in Exploratory...
acs.figshare.com
xlsx
Updated Jun 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tahereh G. Avval; Behnam Moeini; Victoria Carver; Neal Fairley; Emily F. Smith; Jonas Baltrusaitis; Vincent Fernandez; Bonnie. J. Tyler; Neal Gallagher; Matthew R. Linford (2023). The Often-Overlooked Power of Summary Statistics in Exploratory Data Analysis: Comparison of Pattern Recognition Entropy (PRE) to Other Summary Statistics and Introduction of Divided Spectrum-PRE (DS-PRE) [Dataset]. http://doi.org/10.1021/acs.jcim.1c00244.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.1c00244.s002
Dataset updated
Jun 8, 2023
Dataset provided by
ACS Publications
Authors
Tahereh G. Avval; Behnam Moeini; Victoria Carver; Neal Fairley; Emily F. Smith; Jonas Baltrusaitis; Vincent Fernandez; Bonnie. J. Tyler; Neal Gallagher; Matthew R. Linford
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Unsupervised exploratory data analysis (EDA) is often the first step in understanding complex data sets. While summary statistics are among the most efficient and convenient tools for exploring and describing sets of data, they are often overlooked in EDA. In this paper, we show multiple case studies that compare the performance, including clustering, of a series of summary statistics in EDA. The summary statistics considered here are pattern recognition entropy (PRE), the mean, standard deviation (STD), 1-norm, range, sum of squares (SSQ), and X4, which are compared with principal component analysis (PCA), multivariate curve resolution (MCR), and/or cluster analysis. PRE and the other summary statistics are direct methods for analyzing datathey are not factor-based approaches. To quantify the performance of summary statistics, we use the concept of the “critical pair,” which is employed in chromatography. The data analyzed here come from different analytical methods. Hyperspectral images, including one of a biological material, are also analyzed. In general, PRE outperforms the other summary statistics, especially in image analysis, although a suite of summary statistics is useful in exploring complex data sets. While PRE results were generally comparable to those from PCA and MCR, PRE is easier to apply. For example, there is no need to determine the number of factors that describe a data set. Finally, we introduce the concept of divided spectrum-PRE (DS-PRE) as a new EDA method. DS-PRE increases the discrimination power of PRE. We also show that DS-PRE can be used to provide the inputs for the k-nearest neighbor (kNN) algorithm. We recommend PRE and DS-PRE as rapid new tools for unsupervised EDA.

Data from: Supplementary Material for "Sonification for Exploratory Data...

pub.uni-bielefeld.de
search.datacite.org

Updated Feb 5, 2019

Facebook

Twitter

Click to copy link

Link copied

Cite

Thomas Hermann (2019). Supplementary Material for "Sonification for Exploratory Data Analysis" [Dataset]. https://pub.uni-bielefeld.de/record/2920448

Explore at:

Dataset updated

Feb 5, 2019

Authors

Thomas Hermann

License

Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically

Description

Sonification for Exploratory Data Analysis

Chapter 8: Sonification Models

In Chapter 8 of the thesis, 6 sonification models are presented to give some examples for the framework of Model-Based Sonification, developed in Chapter 7. Sonification models determine the rendering of the sonification and possible interactions. The "model in mind" helps the user to interprete the sound with respect to the data.

8.1 Data Sonograms

Data Sonograms use spherical expanding shock waves to excite linear oscillators which are represented by point masses in model space.

Table 8.2, page 87: Sound examples for Data Sonograms

File:	Iris dataset: started in plot "https://pub.uni-bielefeld.de/download/2920448/2920454">(a) at S0 (b) at S1 (c) at S2 10d noisy circle dataset: started in plot (c) at "https://pub.uni-bielefeld.de/download/2920448/2920451">S0 (mean) (d) at S1 (edge) 10d Gaussian: plot (d) started at S0 3 clusters: Example 1 3 clusters: invisible columns used as output variables: "https://pub.uni-bielefeld.de/download/2920448/2920450">Example 2
Description:	Data Sonogram Sound examples for synthetic datasets and the Iris dataset
Duration:	about 5 s

8.2 Particle Trajectory Sonification Model

This sonification model explores features of a data distribution by computing the trajectories of test particles which are injected into model space and move according to Newton's laws of motion in a potential given by the dataset.

Sound example: page 93, PTSM-Ex-1 Audification of 1 particle in the potential of phi(x).
Sound example: page 93, PTSM-Ex-2 Audification of a sequence of 15 particles in the potential of a dataset with 2 clusters.
Sound example: page 94, PTSM-Ex-3 Audification of 25 particles simultaneous in a potential of a dataset with 2 clusters.
Sound example: page 94, PTSM-Ex-4 Audification of 25 particles simultaneous in a potential of a dataset with 1 cluster.
Sound example: page 95, PTSM-Ex-5 sigma-step sequence for a mixture of three Gaussian clusters
Sound example: page 95, PTSM-Ex-6 sigma-step sequence for a Gaussian cluster
Sound example: page 96, PTSM-Iris-1 Sonification for the Iris Dataset with 20 particles per step.
Sound example: page 96, PTSM-Iris-2 Sonification for the Iris Dataset with 3 particles per step.
Sound example: page 96, PTSM-Tetra-1 Sonification for a 4d tetrahedron clusters dataset.

8.3 Markov chain Monte Carlo Sonification

The McMC Sonification Model defines a exploratory process in the domain of a given density p such that the acoustic representation summarizes features of p, particularly concerning the modes of p by sound.

Sound Example: page 105, MCMC-Ex-1 McMC Sonification, stabilization of amplitudes.
Sound Example: page 106, MCMC-Ex-2 Trajectory Audification for 100 McMC steps in 3 cluster dataset
McMC Sonification for Cluster Analysis, dataset with three clusters, page 107
- Stream 1 MCMC-Ex-3.1
- Stream 2 MCMC-Ex-3.2
- Stream 3 MCMC-Ex-3.3
- Mix MCMC-Ex-3.4
McMC Sonification for Cluster

f
Data from: Multivariate Outliers and the O3 Plot
figshare.com
tandf.figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antony Unwin (2023). Multivariate Outliers and the O3 Plot [Dataset]. http://doi.org/10.6084/m9.figshare.7792115.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7792115.v1
Dataset updated
May 31, 2023
Dataset provided by
Taylor & Francis
Authors
Antony Unwin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Identifying and dealing with outliers is an important part of data analysis. A new visualization, the O3 plot, is introduced to aid in the display and understanding of patterns of multivariate outliers. It uses the results of identifying outliers for every possible combination of dataset variables to provide insight into why particular cases are outliers. The O3 plot can be used to compare the results from up to six different outlier identification methods. There is anRpackage OutliersO3 implementing the plot. The article is illustrated with outlier analyses of German demographic and economic data. Supplementary materials for this article are available online.
Data from: Superheat: An R Package for Creating Beautiful and Extendable...
tandf.figshare.com
bin
Updated Mar 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rebecca L. Barter; Bin Yu (2024). Superheat: An R Package for Creating Beautiful and Extendable Heatmaps for Visualizing Complex Data [Dataset]. http://doi.org/10.6084/m9.figshare.6287693.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6287693.v1
Dataset updated
Mar 4, 2024
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Rebecca L. Barter; Bin Yu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The technological advancements of the modern era have enabled the collection of huge amounts of data in science and beyond. Extracting useful information from such massive datasets is an ongoing challenge as traditional data visualization tools typically do not scale well in high-dimensional settings. An existing visualization technique that is particularly well suited to visualizing large datasets is the heatmap. Although heatmaps are extremely popular in fields such as bioinformatics, they remain a severely underutilized visualization tool in modern data analysis. This article introduces superheat, a new R package that provides an extremely flexible and customizable platform for visualizing complex datasets. Superheat produces attractive and extendable heatmaps to which the user can add a response variable as a scatterplot, model results as boxplots, correlation information as barplots, and more. The goal of this article is two-fold: (1) to demonstrate the potential of the heatmap as a core visualization method for a range of data types, and (2) to highlight the customizability and ease of implementation of the superheat R package for creating beautiful and extendable heatmaps. The capabilities and fundamental applicability of the superheat package will be explored via three reproducible case studies, each based on publicly available data sources.
Marketing Analytics
kaggle.com
zip
Updated Mar 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jack Daoud (2022). Marketing Analytics [Dataset]. https://www.kaggle.com/datasets/jackdaoud/marketing-data/discussion
Explore at:
zip(658411 bytes)Available download formats
Dataset updated
Mar 6, 2022
Authors
Jack Daoud
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

This data is publicly available on GitHub here. It can be utilized for EDA, Statistical Analysis, and Visualizations.

Content

The data set ifood_df.csv consists of 2206 customers of XYZ company with data on: - Customer profiles - Product preferences - Campaign successes/failures - Channel performance

Acknowledgement

I do not own this dataset. I am simply making it accessible on this platform via the public GitHub link.
EDA on Zomato dataset
kaggle.com
zip
Updated Jun 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Swati Khedekar (2022). EDA on Zomato dataset [Dataset]. https://www.kaggle.com/datasets/swatikhedekar/eda-on-zomato-dataset
Explore at:
zip(450422 bytes)Available download formats
Dataset updated
Jun 5, 2022
Authors
Swati Khedekar
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
I am practicing here for explore the data analysis skills on Zomato Dataset . Basic Exploratory Data Analysis and Feature Engineering.
What to do in paris
kaggle.com
zip
Updated Dec 11, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miloud Belarebia (2020). What to do in paris [Dataset]. https://www.kaggle.com/milobele/what-to-do-in-paris
Explore at:
zip(1657347 bytes)Available download formats
Dataset updated
Dec 11, 2020
Authors
Miloud Belarebia
Area covered
Paris
Description
Context

The What to do in Paris site is a participative agenda, Parisian places such as the City Libraries and Museums, parks and gardens, entertainment centers, swimming pools, theaters, major venues such as the Gaîté Lyrique, the CENTQUATRE, the Carreau du Temple, concert halls, associations and even Parisians are invited to insert their events in the site.

The source of the data

Content

What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?

Facebook

Twitter

Click to copy link

Link copied

Cite

Shrishti Manja (2024). Ecommerce Dataset for Data Analysis [Dataset]. https://www.kaggle.com/datasets/shrishtimanja/ecommerce-dataset-for-data-analysis/code

Ecommerce Dataset for Data Analysis

Exploratory Data Analysis, Data Visualisation and Machine Learning

Explore at:

zip(2028853 bytes)Available download formats

Dataset updated

Sep 19, 2024

Authors

Shrishti Manja

Description

This dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.

About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.

Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.

This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.

This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning

Clear search

Close search

Google apps

Main menu

Ecommerce Dataset for Data Analysis

Exploratory Data Analysis (EDA) Tools Report

Exploratory Data Analysis (EDA) Tools Report

Exploratory Data Analysis | EDA - use case

Dataset

Contents

DataSheet1_Exploratory data analysis (EDA) machine learning approaches for...

Exploratory data analysis of a clinical study group: Development of a...

Data Analysis for the Systematic Literature Review of DL4SE

Advanced exploratory data analysis (EDA)

Dataset

Contents

Exploratory Data Analysis of Airbnb Data

Data from: FactExplorer: Fact Embedding-Based Exploratory Data Analysis for...

AI Use Case - EDA

Biobyte 2 - Exploratory data analysis

Data from: Penguins Go Parallel: A Grammar of Graphics Framework for...

Data from: The Often-Overlooked Power of Summary Statistics in Exploratory...

Data from: Supplementary Material for "Sonification for Exploratory Data...

Sonification for Exploratory Data Analysis

Chapter 8: Sonification Models

8.1 Data Sonograms

8.2 Particle Trajectory Sonification Model

8.3 Markov chain Monte Carlo Sonification

Data from: Multivariate Outliers and the O3 Plot

Data from: Superheat: An R Package for Creating Beautiful and Extendable...

Marketing Analytics

Context

Content

Acknowledgement

EDA on Zomato dataset

What to do in paris

Context

Content

Acknowledgements

Inspiration

Ecommerce Dataset for Data Analysis

Exploratory Data Analysis, Data Visualisation and Machine Learning