100+ datasets found

Ecommerce Dataset for Data Analysis
kaggle.com
zip
Updated Sep 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shrishti Manja (2024). Ecommerce Dataset for Data Analysis [Dataset]. https://www.kaggle.com/datasets/shrishtimanja/ecommerce-dataset-for-data-analysis/code
Explore at:
zip(2028853 bytes)Available download formats
Dataset updated
Sep 19, 2024
Authors
Shrishti Manja
Description
This dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.

About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.

Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.

This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.

This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning
Exploratory Data Analysis | EDA - use case
kaggle.com
zip
Updated Feb 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohinur Abdurahimova (2022). Exploratory Data Analysis | EDA - use case [Dataset]. https://www.kaggle.com/datasets/mohinurabdurahimova/exploratory-data-analysis-eda-use-case
Explore at:
zip(49796 bytes)Available download formats
Dataset updated
Feb 11, 2022
Authors
Mohinur Abdurahimova
Description
Dataset

This dataset was created by Mohinur Abdurahimova

Released under Data files © Original Authors

Contents
Orange dataset table
figshare.com
xlsx
Updated Mar 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rui Simões (2022). Orange dataset table [Dataset]. http://doi.org/10.6084/m9.figshare.19146410.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19146410.v1
Dataset updated
Mar 4, 2022
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Rui Simões
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The complete dataset used in the analysis comprises 36 samples, each described by 11 numeric features and 1 target. The attributes considered were caspase 3/7 activity, Mitotracker red CMXRos area and intensity (3 h and 24 h incubations with both compounds), Mitosox oxidation (3 h incubation with the referred compounds) and oxidation rate, DCFDA fluorescence (3 h and 24 h incubations with either compound) and oxidation rate, and DQ BSA hydrolysis. The target of each instance corresponds to one of the 9 possible classes (4 samples per class): Control, 6.25, 12.5, 25 and 50 µM for 6-OHDA and 0.03, 0.06, 0.125 and 0.25 µM for rotenone. The dataset is balanced, it does not contain any missing values and data was standardized across features. The small number of samples prevented a full and strong statistical analysis of the results. Nevertheless, it allowed the identification of relevant hidden patterns and trends.

Exploratory data analysis, information gain, hierarchical clustering, and supervised predictive modeling were performed using Orange Data Mining version 3.25.1 [41]. Hierarchical clustering was performed using the Euclidean distance metric and weighted linkage. Cluster maps were plotted to relate the features with higher mutual information (in rows) with instances (in columns), with the color of each cell representing the normalized level of a particular feature in a specific instance. The information is grouped both in rows and in columns by a two-way hierarchical clustering method using the Euclidean distances and average linkage. Stratified cross-validation was used to train the supervised decision tree. A set of preliminary empirical experiments were performed to choose the best parameters for each algorithm, and we verified that, within moderate variations, there were no significant changes in the outcome. The following settings were adopted for the decision tree algorithm: minimum number of samples in leaves: 2; minimum number of samples required to split an internal node: 5; stop splitting when majority reaches: 95%; criterion: gain ratio. The performance of the supervised model was assessed using accuracy, precision, recall, F-measure and area under the ROC curve (AUC) metrics.

Data from: Supplementary Material for "Sonification for Exploratory Data...

pub.uni-bielefeld.de
search.datacite.org

Updated Feb 5, 2019

Facebook

Twitter

Click to copy link

Link copied

Cite

Thomas Hermann (2019). Supplementary Material for "Sonification for Exploratory Data Analysis" [Dataset]. https://pub.uni-bielefeld.de/record/2920448

Explore at:

Dataset updated

Feb 5, 2019

Authors

Thomas Hermann

License

Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically

Description

Sonification for Exploratory Data Analysis

Chapter 8: Sonification Models

In Chapter 8 of the thesis, 6 sonification models are presented to give some examples for the framework of Model-Based Sonification, developed in Chapter 7. Sonification models determine the rendering of the sonification and possible interactions. The "model in mind" helps the user to interprete the sound with respect to the data.

8.1 Data Sonograms

Data Sonograms use spherical expanding shock waves to excite linear oscillators which are represented by point masses in model space.

Table 8.2, page 87: Sound examples for Data Sonograms

File:	Iris dataset: started in plot "https://pub.uni-bielefeld.de/download/2920448/2920454">(a) at S0 (b) at S1 (c) at S2 10d noisy circle dataset: started in plot (c) at "https://pub.uni-bielefeld.de/download/2920448/2920451">S0 (mean) (d) at S1 (edge) 10d Gaussian: plot (d) started at S0 3 clusters: Example 1 3 clusters: invisible columns used as output variables: "https://pub.uni-bielefeld.de/download/2920448/2920450">Example 2
Description:	Data Sonogram Sound examples for synthetic datasets and the Iris dataset
Duration:	about 5 s

8.2 Particle Trajectory Sonification Model

This sonification model explores features of a data distribution by computing the trajectories of test particles which are injected into model space and move according to Newton's laws of motion in a potential given by the dataset.

Sound example: page 93, PTSM-Ex-1 Audification of 1 particle in the potential of phi(x).
Sound example: page 93, PTSM-Ex-2 Audification of a sequence of 15 particles in the potential of a dataset with 2 clusters.
Sound example: page 94, PTSM-Ex-3 Audification of 25 particles simultaneous in a potential of a dataset with 2 clusters.
Sound example: page 94, PTSM-Ex-4 Audification of 25 particles simultaneous in a potential of a dataset with 1 cluster.
Sound example: page 95, PTSM-Ex-5 sigma-step sequence for a mixture of three Gaussian clusters
Sound example: page 95, PTSM-Ex-6 sigma-step sequence for a Gaussian cluster
Sound example: page 96, PTSM-Iris-1 Sonification for the Iris Dataset with 20 particles per step.
Sound example: page 96, PTSM-Iris-2 Sonification for the Iris Dataset with 3 particles per step.
Sound example: page 96, PTSM-Tetra-1 Sonification for a 4d tetrahedron clusters dataset.

8.3 Markov chain Monte Carlo Sonification

The McMC Sonification Model defines a exploratory process in the domain of a given density p such that the acoustic representation summarizes features of p, particularly concerning the modes of p by sound.

Sound Example: page 105, MCMC-Ex-1 McMC Sonification, stabilization of amplitudes.
Sound Example: page 106, MCMC-Ex-2 Trajectory Audification for 100 McMC steps in 3 cluster dataset
McMC Sonification for Cluster Analysis, dataset with three clusters, page 107
- Stream 1 MCMC-Ex-3.1
- Stream 2 MCMC-Ex-3.2
- Stream 3 MCMC-Ex-3.3
- Mix MCMC-Ex-3.4
McMC Sonification for Cluster

Data from: Penguins Go Parallel: A Grammar of Graphics Framework for...
tandf.figshare.com
txt
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Susan VanderPlas; Yawei Ge; Antony Unwin; Heike Hofmann (2023). Penguins Go Parallel: A Grammar of Graphics Framework for Generalized Parallel Coordinate Plots [Dataset]. http://doi.org/10.6084/m9.figshare.22467369.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22467369.v2
Dataset updated
Jun 2, 2023
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Susan VanderPlas; Yawei Ge; Antony Unwin; Heike Hofmann
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Parallel Coordinate Plots (PCP) are a valuable tool for exploratory data analysis of high-dimensional numerical data. The use of PCPs is limited when working with categorical variables or a mix of categorical and continuous variables. In this article, we propose Generalized Parallel Coordinate Plots (GPCP) to extend the ability of PCPs from just numeric variables to dealing seamlessly with a mix of categorical and numeric variables in a single plot. In this process we find that existing solutions for categorical values only, such as hammock plots or parsets become edge cases in the new framework. By focusing on individual observations rather than a marginal frequency we gain additional flexibility. The resulting approach is implemented in the R package ggpcp. Supplementary materials for this article are available online.
Calculus Video Worked Example Data
kaggle.com
zip
Updated Aug 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jocelyn Dumlao (2023). Calculus Video Worked Example Data [Dataset]. https://www.kaggle.com/datasets/jocelyndumlao/calculus-video-worked-example-data
Explore at:
zip(2757 bytes)Available download formats
Dataset updated
Aug 15, 2023
Authors
Jocelyn Dumlao
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Description:

Summary data from a Calculus II class where students were required to watch an instructional video before or after the lecture. The dataset includes gender (1=female; 2=male), vgroup (-1=before lecture; 1=after lecture), binary flag for 26 individual videos (1=watched 80% or more of length of video; 0=not watched), videosum (sum of number of videos watched), final_raw (raw grade student received on cumulative final course exam), sat_math (scaled SAT-Math score out of 800), math_place (institutional calculus readiness score out of 100), watched20 (grouping flag for students who watched 20 or more videos).

Categories:

Mathematics Education

Acknowledgements:

DeFranco, Thomas; Judd, Jamison
f
Data from: Multivariate Outliers and the O3 Plot
figshare.com
tandf.figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antony Unwin (2023). Multivariate Outliers and the O3 Plot [Dataset]. http://doi.org/10.6084/m9.figshare.7792115.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7792115.v1
Dataset updated
May 31, 2023
Dataset provided by
Taylor & Francis
Authors
Antony Unwin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Identifying and dealing with outliers is an important part of data analysis. A new visualization, the O3 plot, is introduced to aid in the display and understanding of patterns of multivariate outliers. It uses the results of identifying outliers for every possible combination of dataset variables to provide insight into why particular cases are outliers. The O3 plot can be used to compare the results from up to six different outlier identification methods. There is anRpackage OutliersO3 implementing the plot. The article is illustrated with outlier analyses of German demographic and economic data. Supplementary materials for this article are available online.
r
Exploratory data analysis of infrared spectra from 3D-printing polymers
researchdata.edu.au
Updated Oct 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simon Lewis; Michael V. Adamos; Kari Pitts; Georgina Sauzier (2025). Exploratory data analysis of infrared spectra from 3D-printing polymers [Dataset]. http://doi.org/10.25917/FN6A-AZ80
Explore at:
Unique identifier
https://doi.org/10.25917/FN6A-AZ80
Dataset updated
Oct 31, 2025
Dataset provided by
Curtin University
Authors
Simon Lewis; Michael V. Adamos; Kari Pitts; Georgina Sauzier
Description
Data description: This dataset consists of spectroscopic data files and associated R-scripts for exploratory data analysis. Attenuated total reflectance Fourier transform infrared (ATR-FTIR) spectra were collected from 67 samples of polymer filaments potentially used to produce illicit 3D-printed items. Principal component analysis (PCA) was used to determine if any individual filaments gave distinctive spectral signatures, potentially allowing traceability of 3D-printed items for forensic purposes. The project also investigated potential chemical variations induced by the filament manufacturing or 3D-printing process. Data was collected and analysed by Michael Adamos at Curtin University (Perth, Western Australia), under the supervision of Dr Georgina Sauzier and Prof. Simon Lewis and with specialist input from Dr Kari Pitts.

Data collection time details: 2024
Number of files/types: 3 .R files, 702 .JDX files
Geographic information (if relevant): Australia
Keywords: 3D printing, polymers, infrared spectroscopy, forensic science
f
Data from: The Often-Overlooked Power of Summary Statistics in Exploratory...
acs.figshare.com
xlsx
Updated Jun 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tahereh G. Avval; Behnam Moeini; Victoria Carver; Neal Fairley; Emily F. Smith; Jonas Baltrusaitis; Vincent Fernandez; Bonnie. J. Tyler; Neal Gallagher; Matthew R. Linford (2023). The Often-Overlooked Power of Summary Statistics in Exploratory Data Analysis: Comparison of Pattern Recognition Entropy (PRE) to Other Summary Statistics and Introduction of Divided Spectrum-PRE (DS-PRE) [Dataset]. http://doi.org/10.1021/acs.jcim.1c00244.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.1c00244.s002
Dataset updated
Jun 8, 2023
Dataset provided by
ACS Publications
Authors
Tahereh G. Avval; Behnam Moeini; Victoria Carver; Neal Fairley; Emily F. Smith; Jonas Baltrusaitis; Vincent Fernandez; Bonnie. J. Tyler; Neal Gallagher; Matthew R. Linford
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Unsupervised exploratory data analysis (EDA) is often the first step in understanding complex data sets. While summary statistics are among the most efficient and convenient tools for exploring and describing sets of data, they are often overlooked in EDA. In this paper, we show multiple case studies that compare the performance, including clustering, of a series of summary statistics in EDA. The summary statistics considered here are pattern recognition entropy (PRE), the mean, standard deviation (STD), 1-norm, range, sum of squares (SSQ), and X4, which are compared with principal component analysis (PCA), multivariate curve resolution (MCR), and/or cluster analysis. PRE and the other summary statistics are direct methods for analyzing datathey are not factor-based approaches. To quantify the performance of summary statistics, we use the concept of the “critical pair,” which is employed in chromatography. The data analyzed here come from different analytical methods. Hyperspectral images, including one of a biological material, are also analyzed. In general, PRE outperforms the other summary statistics, especially in image analysis, although a suite of summary statistics is useful in exploring complex data sets. While PRE results were generally comparable to those from PCA and MCR, PRE is easier to apply. For example, there is no need to determine the number of factors that describe a data set. Finally, we introduce the concept of divided spectrum-PRE (DS-PRE) as a new EDA method. DS-PRE increases the discrimination power of PRE. We also show that DS-PRE can be used to provide the inputs for the k-nearest neighbor (kNN) algorithm. We recommend PRE and DS-PRE as rapid new tools for unsupervised EDA.
d
Data from: Research and exploratory analysis driven - time-data...
datadryad.org
data.niaid.nih.gov
+1more
zip
Updated Jan 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Del Gaizo; Kenneth Catchpole; Alexander Alekseyenko (2022). Research and exploratory analysis driven - time-data visualization (read-tv) software [Dataset]. http://doi.org/10.5061/dryad.d51c5b02g
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.d51c5b02g
Dataset updated
Jan 30, 2022
Dataset provided by
Dryad
Authors
John Del Gaizo; Kenneth Catchpole; Alexander Alekseyenko
Time period covered
Jan 25, 2021
Description
This section does not describe the methods of read-tv software development, which can be found in the associated manuscript from JAMIA Open (JAMIO-2020-0121.R1). This section describes the methods involved in the surgical work flow disruption data collection. A curated, PHI-free (protected health information) version of this dataset was used as a use case for this manuscript.

Observer training

Trained human factors researchers conducted each observation following the completion of observer training. The researchers were two full-time research assistants based in the department of surgery at site 3 who visited the other two sites to collect data. Human Factors experts guided and trained each observer in the identification and standardized collection of FDs. The observers were also trained in the basic components of robotic surgery in order to be able to tangibly isolate and describe such disruptive events.

Comprehensive observer training was ensured with both classroom and floor train...
f
Data from: ftmsRanalysis: An R package for exploratory data analysis and...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Mar 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bramer, Lisa M.; Claborne, Daniel; Stratton, Kelly G.; Hofmockel, Kirsten; Thompson, Allison M.; McCue, Lee Ann; White, Amanda M. (2020). ftmsRanalysis: An R package for exploratory data analysis and interactive visualization of FT-MS data [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000452686
Explore at:
Dataset updated
Mar 16, 2020
Authors
Bramer, Lisa M.; Claborne, Daniel; Stratton, Kelly G.; Hofmockel, Kirsten; Thompson, Allison M.; McCue, Lee Ann; White, Amanda M.
Description
The high-resolution and mass accuracy of Fourier transform mass spectrometry (FT-MS) has made it an increasingly popular technique for discerning the composition of soil, plant and aquatic samples containing complex mixtures of proteins, carbohydrates, lipids, lignins, hydrocarbons, phytochemicals and other compounds. Thus, there is a growing demand for informatics tools to analyze FT-MS data that will aid investigators seeking to understand the availability of carbon compounds to biotic and abiotic oxidation and to compare fundamental chemical properties of complex samples across groups. We present ftmsRanalysis, an R package which provides an extensive collection of data formatting and processing, filtering, visualization, and sample and group comparison functionalities. The package provides a suite of plotting methods and enables expedient, flexible and interactive visualization of complex datasets through functions which link to a powerful and interactive visualization user interface, Trelliscope. Example analysis using FT-MS data from a soil microbiology study demonstrates the core functionality of the package and highlights the capabilities for producing interactive visualizations.
o
Whistlerlib: a distributed computing library for exploratory data analysis...
repositorio.observatoriogeo.mx
Updated Oct 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Whistlerlib: a distributed computing library for exploratory data analysis on large social network datasets - Dataset - Repositorio del Observatorio Metropolitano CentroGeo [Dataset]. http://repositorio.observatoriogeo.mx/dataset/1ee805b50082
Explore at:
Dataset updated
Oct 21, 2025
Description
At least 350k posts are published on X, 510k comments are posted on Facebook, and 66k pictures and videos are shared on Instagram each minute. These large datasets require substantial processing power, even if only a percentage is collected for analysis and research. To face this challenge, data scientists can now use computer clusters deployed on various IaaS and PaaS services in the cloud. However, scientists still have to master the design of distributed algorithms and be familiar with using distributed computing programming frameworks. It is thus essential to generate tools that provide analysis methods to leverage the advantages of computer clusters for processing large amounts of social network text. This paper presents Whistlerlib, a new Python library for conducting exploratory analysis on large text datasets on social networks. Whistlerlib implements distributed versions of various social media, sentiment, and social network analysis methods that can run atop computer clusters. We experimentally demonstrate the scalability of the various Whistlerlib distributed methods when deployed on a public cloud platform. We also present a practical example of the analysis of posts on the social network X about the Mexico City subway to showcase the features of Whistlerlib in scenarios where social network analysis tools are needed to address issues with a social dimension.
2010 Census: Iowa Population by ZCTA
kaggle.com
zip
Updated May 2, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mark Mucchetti (2020). 2010 Census: Iowa Population by ZCTA [Dataset]. https://www.kaggle.com/markmucchetti/2010-census-iowa-population-by-zcta
Explore at:
zip(8057 bytes)Available download formats
Dataset updated
May 2, 2020
Authors
Mark Mucchetti
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
Iowa
Description
Context

This is a sample dataset derived from 2010 U.S. Government Census data. It is intended to be used in combination with example analyses on the public dataset "Iowa Liquor Sales", available as a Google Public Dataset, on Kaggle, and at https://data.iowa.gov/Sales-Distribution/Iowa-Liquor-Sales/m3tr-qhgy.

Usage

This dataset is intended for use as an example. Columns have purposely not been filtered by string manipulation in order to explore joining data between two pandas DataFrames and to do further processing.

Because this data is at the Zip Code Tabulation Area (ZCTA) level, additional processing is required to join it with general-purpose datasets, which may be specified at the zip code, county name, county FIPS code, or coordinate level. This is intentional.
Data from: Visualization for Interval Data
tandf.figshare.com
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muzi Zhang; Dennis K. J. Lin (2023). Visualization for Interval Data [Dataset]. http://doi.org/10.6084/m9.figshare.19617396.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19617396.v2
Dataset updated
Jun 1, 2023
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Muzi Zhang; Dennis K. J. Lin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Interval data are widely used in many fields, notably in economics, industry, and health areas. Analogous to the scatterplot for single-value data, the rectangle plot and cross plot are the conventional visualization methods for the relationship between two variables in interval forms. These methods do not provide much information to assess complicated relationships, however. In this article, we propose two visualization methods: Segment and Dandelion plots. They offer much more information than the existing visualization methods and allow us to have a much better understanding of the relationship between two variables in interval forms. A general guide for reading these plots is provided. Relevant theoretical support is developed. Both empirical and real data examples are provided to demonstrate the advantages of the proposed visualization methods. Supplementary materials for this article are available online.
f
Data from: MAIN MINERALS AND ORGANIC COMPOUNDS IN COMMERCIAL ROASTED AND...
datasetcatalog.nlm.nih.gov
scielo.figshare.com
Updated Mar 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Flores, Eder Lisandro Moraes; Leite, Oldair Donizete; Canan, Cristiane; Kalschne, Daneysa Lahis; de Toledo Benassi, Marta; Silva, Nathalia Karen (2021). MAIN MINERALS AND ORGANIC COMPOUNDS IN COMMERCIAL ROASTED AND GROUND COFFEE: AN EXPLORATORY DATA ANALYSIS [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000870684
Explore at:
Dataset updated
Mar 24, 2021
Authors
Flores, Eder Lisandro Moraes; Leite, Oldair Donizete; Canan, Cristiane; Kalschne, Daneysa Lahis; de Toledo Benassi, Marta; Silva, Nathalia Karen
Description
Coffee is one of the most popular beverages in the world, however, little information is found regarding the mineral composition of commercial roasted and ground coffees (RG) and its correlation with organic bioactive compounds. 21 commercial Brazilian RG coffee brands - 9 traditional (T) and 12 extra strong (ES) roasted ones - were analyzed for the Cu, Ca, Mn, Mg, K, Zn, and Fe minerals, caffeine, 5-caffeoylquinic acid (5-CQA) and melanoidins contents. For minerals determination by flame atomic absorption spectrometry (FAAS), the samples were decomposed by microwave-assisted wet digestion. Caffeine and 5-CQA were determined by liquid chromatography and melanoidins by molecular absorption spectrometry. The minerals and organic compounds contents association in RG coffee was observed by a principal component analysis. The thermostable compounds (minerals and caffeine) were related to dimension 1 and 2, while 5-CQA and melanoidins were related to dimension 3, allowing for the T coffees segmentation from ES ones.
Learn CV
kaggle.com
zip
Updated May 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sani Kamal (2022). Learn CV [Dataset]. https://www.kaggle.com/sanikamal/learn-cv
Explore at:
zip(46808328 bytes)Available download formats
Dataset updated
May 11, 2022
Authors
Sani Kamal
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Sani Kamal

Released under CC0: Public Domain

Contents
A data analysis framework for biomedical big data: Application on mesoderm...
plos.figshare.com
txt
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benjamin Ulfenborg; Alexander Karlsson; Maria Riveiro; Caroline Améen; Karolina Åkesson; Christian X. Andersson; Peter Sartipy; Jane Synnergren (2023). A data analysis framework for biomedical big data: Application on mesoderm differentiation of human pluripotent stem cells [Dataset]. http://doi.org/10.1371/journal.pone.0179613
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0179613
Dataset updated
Jun 3, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Benjamin Ulfenborg; Alexander Karlsson; Maria Riveiro; Caroline Améen; Karolina Åkesson; Christian X. Andersson; Peter Sartipy; Jane Synnergren
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The development of high-throughput biomolecular technologies has resulted in generation of vast omics data at an unprecedented rate. This is transforming biomedical research into a big data discipline, where the main challenges relate to the analysis and interpretation of data into new biological knowledge. The aim of this study was to develop a framework for biomedical big data analytics, and apply it for analyzing transcriptomics time series data from early differentiation of human pluripotent stem cells towards the mesoderm and cardiac lineages. To this end, transcriptome profiling by microarray was performed on differentiating human pluripotent stem cells sampled at eleven consecutive days. The gene expression data was analyzed using the five-stage analysis framework proposed in this study, including data preparation, exploratory data analysis, confirmatory analysis, biological knowledge discovery, and visualization of the results. Clustering analysis revealed several distinct expression profiles during differentiation. Genes with an early transient response were strongly related to embryonic- and mesendoderm development, for example CER1 and NODAL. Pluripotency genes, such as NANOG and SOX2, exhibited substantial downregulation shortly after onset of differentiation. Rapid induction of genes related to metal ion response, cardiac tissue development, and muscle contraction were observed around day five and six. Several transcription factors were identified as potential regulators of these processes, e.g. POU1F1, TCF4 and TBP for muscle contraction genes. Pathway analysis revealed temporal activity of several signaling pathways, for example the inhibition of WNT signaling on day 2 and its reactivation on day 4. This study provides a comprehensive characterization of biological events and key regulators of the early differentiation of human pluripotent stem cells towards the mesoderm and cardiac lineages. The proposed analysis framework can be used to structure data analysis in future research, both in stem cell differentiation, and more generally, in biomedical big data analytics.
Breast Cancer Exploratory Data Analysis EDA
kaggle.com
zip
Updated Nov 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr. Nagendra (2025). Breast Cancer Exploratory Data Analysis EDA [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/breast-cancer-exploratory-data-analysis-eda
Explore at:
zip(7609364 bytes)Available download formats
Dataset updated
Nov 29, 2025
Authors
Dr. Nagendra
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset contains clinical and diagnostic features related to Breast Cancer, designed for comprehensive Exploratory Data Analysis (EDA) and subsequent predictive modeling.

It is derived from digitized images of Fine Needle Aspirates (FNA) of breast masses.

The dataset features quantitative measurements, typically calculated from the characteristics of cell nuclei, including: - Radius - Texture - Perimeter - Area - Smoothness - Compactness - Concavity - Concave Points - Symmetry - Fractal Dimension

These features are provided as mean, standard error, and "worst" (largest) values.

The primary goal of this resource is to support the validation of EDA techniques necessary for clinical data science: - Data quality assessment (missing values, inconsistencies). - Feature assessment (distributions, correlations). - Visualization for diagnostic modeling.

The primary target variable is the binary classification of the tissue sample: Malignant vs. Benign.
Streaming Service Data
kaggle.com
Updated Dec 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chad Wambles (2024). Streaming Service Data [Dataset]. https://www.kaggle.com/datasets/chadwambles/streaming-service-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 19, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Chad Wambles
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
A dataset I generated to showcase a sample set of user data for a fictional streaming service. This data is great for practicing SQL, Excel, Tableau, or Power BI.

1000 rows and 25 columns of connected data.

See below for column descriptions.

Enjoy :)
f
Data from: Visualizing Kendall’s τ and Hidden Structures in Ranked Data
figshare.com
docx
Updated Nov 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicholas D. Edwards; Enzo de Jong; Feng Liu; Stephen T. Ferguson (2025). Visualizing Kendall’s τ and Hidden Structures in Ranked Data [Dataset]. http://doi.org/10.6084/m9.figshare.30188503.v1
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.30188503.v1
Dataset updated
Nov 12, 2025
Dataset provided by
Taylor & Francis
Authors
Nicholas D. Edwards; Enzo de Jong; Feng Liu; Stephen T. Ferguson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Ranked data is commonly used in research across many fields of study including medicine, biology, psychology, and economics. One common statistic used for analyzing ranked data is Kendall’s τ coefficient, a nonparametric measure of rank correlation which describes the strength of the association between two monotonic continuous or ordinal variables. While the mathematics involved in calculating Kendall’s τ is well-established, there are relatively few graphing methods available to visualize the results. Here, we describe several alternative and complementary visualization methods and provide an interactive app for graphing Kendall’s τ. The resulting graphs provide a visualization of rank correlation which helps display the proportion of concordant and discordant pairs. Moreover, these methods highlight other key features of the data which are not represented by Kendall’s τ alone but may nevertheless be meaningful, such as longer monotonic chains and the relationship between discrete pairs of observations. We demonstrate the utility of these approaches through several examples and compare our results to other visualization methods.

Facebook

Twitter

Click to copy link

Link copied

Cite

Shrishti Manja (2024). Ecommerce Dataset for Data Analysis [Dataset]. https://www.kaggle.com/datasets/shrishtimanja/ecommerce-dataset-for-data-analysis/code

Ecommerce Dataset for Data Analysis

Exploratory Data Analysis, Data Visualisation and Machine Learning

Explore at:

zip(2028853 bytes)Available download formats

Dataset updated

Sep 19, 2024

Authors

Shrishti Manja

Description

This dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.

About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.

Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.

This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.

This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning

Clear search

Close search

Google apps

Main menu

Ecommerce Dataset for Data Analysis

Exploratory Data Analysis | EDA - use case

Dataset

Contents

Orange dataset table

Data from: Supplementary Material for "Sonification for Exploratory Data...

Sonification for Exploratory Data Analysis

Chapter 8: Sonification Models

8.1 Data Sonograms

8.2 Particle Trajectory Sonification Model

8.3 Markov chain Monte Carlo Sonification

Data from: Penguins Go Parallel: A Grammar of Graphics Framework for...

Calculus Video Worked Example Data

Description:

Categories:

Acknowledgements:

Data from: Multivariate Outliers and the O3 Plot

Exploratory data analysis of infrared spectra from 3D-printing polymers

Data from: The Often-Overlooked Power of Summary Statistics in Exploratory...

Data from: Research and exploratory analysis driven - time-data...

Data from: ftmsRanalysis: An R package for exploratory data analysis and...

Whistlerlib: a distributed computing library for exploratory data analysis...

2010 Census: Iowa Population by ZCTA

Context

Usage

Data from: Visualization for Interval Data

Data from: MAIN MINERALS AND ORGANIC COMPOUNDS IN COMMERCIAL ROASTED AND...

Learn CV

Dataset

Contents

A data analysis framework for biomedical big data: Application on mesoderm...

Breast Cancer Exploratory Data Analysis EDA

Streaming Service Data

Data from: Visualizing Kendall’s τ and Hidden Structures in Ranked Data

Ecommerce Dataset for Data Analysis

Exploratory Data Analysis, Data Visualisation and Machine Learning