100+ datasets found
  1. Ecommerce Dataset for Data Analysis

    • kaggle.com
    zip
    Updated Sep 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shrishti Manja (2024). Ecommerce Dataset for Data Analysis [Dataset]. https://www.kaggle.com/datasets/shrishtimanja/ecommerce-dataset-for-data-analysis/code
    Explore at:
    zip(2028853 bytes)Available download formats
    Dataset updated
    Sep 19, 2024
    Authors
    Shrishti Manja
    Description

    This dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.

    About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.

    Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.

    This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.

    This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning

  2. Exploratory Data Analysis | EDA - use case

    • kaggle.com
    zip
    Updated Feb 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohinur Abdurahimova (2022). Exploratory Data Analysis | EDA - use case [Dataset]. https://www.kaggle.com/datasets/mohinurabdurahimova/exploratory-data-analysis-eda-use-case
    Explore at:
    zip(49796 bytes)Available download formats
    Dataset updated
    Feb 11, 2022
    Authors
    Mohinur Abdurahimova
    Description

    Dataset

    This dataset was created by Mohinur Abdurahimova

    Released under Data files © Original Authors

    Contents

  3. Orange dataset table

    • figshare.com
    xlsx
    Updated Mar 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rui Simões (2022). Orange dataset table [Dataset]. http://doi.org/10.6084/m9.figshare.19146410.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Mar 4, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Rui Simões
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The complete dataset used in the analysis comprises 36 samples, each described by 11 numeric features and 1 target. The attributes considered were caspase 3/7 activity, Mitotracker red CMXRos area and intensity (3 h and 24 h incubations with both compounds), Mitosox oxidation (3 h incubation with the referred compounds) and oxidation rate, DCFDA fluorescence (3 h and 24 h incubations with either compound) and oxidation rate, and DQ BSA hydrolysis. The target of each instance corresponds to one of the 9 possible classes (4 samples per class): Control, 6.25, 12.5, 25 and 50 µM for 6-OHDA and 0.03, 0.06, 0.125 and 0.25 µM for rotenone. The dataset is balanced, it does not contain any missing values and data was standardized across features. The small number of samples prevented a full and strong statistical analysis of the results. Nevertheless, it allowed the identification of relevant hidden patterns and trends.

    Exploratory data analysis, information gain, hierarchical clustering, and supervised predictive modeling were performed using Orange Data Mining version 3.25.1 [41]. Hierarchical clustering was performed using the Euclidean distance metric and weighted linkage. Cluster maps were plotted to relate the features with higher mutual information (in rows) with instances (in columns), with the color of each cell representing the normalized level of a particular feature in a specific instance. The information is grouped both in rows and in columns by a two-way hierarchical clustering method using the Euclidean distances and average linkage. Stratified cross-validation was used to train the supervised decision tree. A set of preliminary empirical experiments were performed to choose the best parameters for each algorithm, and we verified that, within moderate variations, there were no significant changes in the outcome. The following settings were adopted for the decision tree algorithm: minimum number of samples in leaves: 2; minimum number of samples required to split an internal node: 5; stop splitting when majority reaches: 95%; criterion: gain ratio. The performance of the supervised model was assessed using accuracy, precision, recall, F-measure and area under the ROC curve (AUC) metrics.

  4. u

    Data from: Supplementary Material for "Sonification for Exploratory Data...

    • pub.uni-bielefeld.de
    • search.datacite.org
    Updated Feb 5, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas Hermann (2019). Supplementary Material for "Sonification for Exploratory Data Analysis" [Dataset]. https://pub.uni-bielefeld.de/record/2920448
    Explore at:
    Dataset updated
    Feb 5, 2019
    Authors
    Thomas Hermann
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Sonification for Exploratory Data Analysis

    Chapter 8: Sonification Models

    In Chapter 8 of the thesis, 6 sonification models are presented to give some examples for the framework of Model-Based Sonification, developed in Chapter 7. Sonification models determine the rendering of the sonification and possible interactions. The "model in mind" helps the user to interprete the sound with respect to the data.

    8.1 Data Sonograms

    Data Sonograms use spherical expanding shock waves to excite linear oscillators which are represented by point masses in model space.

    • Table 8.2, page 87: Sound examples for Data Sonograms
    File:
    Iris dataset: started in plot "https://pub.uni-bielefeld.de/download/2920448/2920454">(a) at S0 (b) at S1 (c) at S2
    10d noisy circle dataset: started in plot (c) at "https://pub.uni-bielefeld.de/download/2920448/2920451">S0 (mean) (d) at S1 (edge)
    10d Gaussian: plot (d) started at S0
    3 clusters: Example 1
    3 clusters: invisible columns used as output variables: "https://pub.uni-bielefeld.de/download/2920448/2920450">Example 2
    Description:
    Data Sonogram Sound examples for synthetic datasets and the Iris dataset
    Duration:
    about 5 s
    8.2 Particle Trajectory Sonification Model

    This sonification model explores features of a data distribution by computing the trajectories of test particles which are injected into model space and move according to Newton's laws of motion in a potential given by the dataset.

    • Sound example: page 93, PTSM-Ex-1 Audification of 1 particle in the potential of phi(x).
    • Sound example: page 93, PTSM-Ex-2 Audification of a sequence of 15 particles in the potential of a dataset with 2 clusters.
    • Sound example: page 94, PTSM-Ex-3 Audification of 25 particles simultaneous in a potential of a dataset with 2 clusters.
    • Sound example: page 94, PTSM-Ex-4 Audification of 25 particles simultaneous in a potential of a dataset with 1 cluster.
    • Sound example: page 95, PTSM-Ex-5 sigma-step sequence for a mixture of three Gaussian clusters
    • Sound example: page 95, PTSM-Ex-6 sigma-step sequence for a Gaussian cluster
    • Sound example: page 96, PTSM-Iris-1 Sonification for the Iris Dataset with 20 particles per step.
    • Sound example: page 96, PTSM-Iris-2 Sonification for the Iris Dataset with 3 particles per step.
    • Sound example: page 96, PTSM-Tetra-1 Sonification for a 4d tetrahedron clusters dataset.
    8.3 Markov chain Monte Carlo Sonification

    The McMC Sonification Model defines a exploratory process in the domain of a given density p such that the acoustic representation summarizes features of p, particularly concerning the modes of p by sound.

    • Sound Example: page 105, MCMC-Ex-1 McMC Sonification, stabilization of amplitudes.
    • Sound Example: page 106, MCMC-Ex-2 Trajectory Audification for 100 McMC steps in 3 cluster dataset
    • McMC Sonification for Cluster Analysis, dataset with three clusters, page 107
    • McMC Sonification for Cluster
  5. Data from: Penguins Go Parallel: A Grammar of Graphics Framework for...

    • tandf.figshare.com
    txt
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Susan VanderPlas; Yawei Ge; Antony Unwin; Heike Hofmann (2023). Penguins Go Parallel: A Grammar of Graphics Framework for Generalized Parallel Coordinate Plots [Dataset]. http://doi.org/10.6084/m9.figshare.22467369.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Susan VanderPlas; Yawei Ge; Antony Unwin; Heike Hofmann
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Parallel Coordinate Plots (PCP) are a valuable tool for exploratory data analysis of high-dimensional numerical data. The use of PCPs is limited when working with categorical variables or a mix of categorical and continuous variables. In this article, we propose Generalized Parallel Coordinate Plots (GPCP) to extend the ability of PCPs from just numeric variables to dealing seamlessly with a mix of categorical and numeric variables in a single plot. In this process we find that existing solutions for categorical values only, such as hammock plots or parsets become edge cases in the new framework. By focusing on individual observations rather than a marginal frequency we gain additional flexibility. The resulting approach is implemented in the R package ggpcp. Supplementary materials for this article are available online.

  6. Calculus Video Worked Example Data

    • kaggle.com
    zip
    Updated Aug 15, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jocelyn Dumlao (2023). Calculus Video Worked Example Data [Dataset]. https://www.kaggle.com/datasets/jocelyndumlao/calculus-video-worked-example-data
    Explore at:
    zip(2757 bytes)Available download formats
    Dataset updated
    Aug 15, 2023
    Authors
    Jocelyn Dumlao
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Description:

    Summary data from a Calculus II class where students were required to watch an instructional video before or after the lecture. The dataset includes gender (1=female; 2=male), vgroup (-1=before lecture; 1=after lecture), binary flag for 26 individual videos (1=watched 80% or more of length of video; 0=not watched), videosum (sum of number of videos watched), final_raw (raw grade student received on cumulative final course exam), sat_math (scaled SAT-Math score out of 800), math_place (institutional calculus readiness score out of 100), watched20 (grouping flag for students who watched 20 or more videos).

    Categories:

    Mathematics Education

    Acknowledgements:

    DeFranco, Thomas; Judd, Jamison

  7. f

    Data from: Multivariate Outliers and the O3 Plot

    • figshare.com
    • tandf.figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antony Unwin (2023). Multivariate Outliers and the O3 Plot [Dataset]. http://doi.org/10.6084/m9.figshare.7792115.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Antony Unwin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Identifying and dealing with outliers is an important part of data analysis. A new visualization, the O3 plot, is introduced to aid in the display and understanding of patterns of multivariate outliers. It uses the results of identifying outliers for every possible combination of dataset variables to provide insight into why particular cases are outliers. The O3 plot can be used to compare the results from up to six different outlier identification methods. There is anRpackage OutliersO3 implementing the plot. The article is illustrated with outlier analyses of German demographic and economic data. Supplementary materials for this article are available online.

  8. r

    Exploratory data analysis of infrared spectra from 3D-printing polymers

    • researchdata.edu.au
    Updated Oct 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simon Lewis; Michael V. Adamos; Kari Pitts; Georgina Sauzier (2025). Exploratory data analysis of infrared spectra from 3D-printing polymers [Dataset]. http://doi.org/10.25917/FN6A-AZ80
    Explore at:
    Dataset updated
    Oct 31, 2025
    Dataset provided by
    Curtin University
    Authors
    Simon Lewis; Michael V. Adamos; Kari Pitts; Georgina Sauzier
    Description

    Data description: This dataset consists of spectroscopic data files and associated R-scripts for exploratory data analysis. Attenuated total reflectance Fourier transform infrared (ATR-FTIR) spectra were collected from 67 samples of polymer filaments potentially used to produce illicit 3D-printed items. Principal component analysis (PCA) was used to determine if any individual filaments gave distinctive spectral signatures, potentially allowing traceability of 3D-printed items for forensic purposes. The project also investigated potential chemical variations induced by the filament manufacturing or 3D-printing process. Data was collected and analysed by Michael Adamos at Curtin University (Perth, Western Australia), under the supervision of Dr Georgina Sauzier and Prof. Simon Lewis and with specialist input from Dr Kari Pitts.

    Data collection time details: 2024
    Number of files/types: 3 .R files, 702 .JDX files
    Geographic information (if relevant): Australia
    Keywords: 3D printing, polymers, infrared spectroscopy, forensic science

  9. f

    Data from: The Often-Overlooked Power of Summary Statistics in Exploratory...

    • acs.figshare.com
    xlsx
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tahereh G. Avval; Behnam Moeini; Victoria Carver; Neal Fairley; Emily F. Smith; Jonas Baltrusaitis; Vincent Fernandez; Bonnie. J. Tyler; Neal Gallagher; Matthew R. Linford (2023). The Often-Overlooked Power of Summary Statistics in Exploratory Data Analysis: Comparison of Pattern Recognition Entropy (PRE) to Other Summary Statistics and Introduction of Divided Spectrum-PRE (DS-PRE) [Dataset]. http://doi.org/10.1021/acs.jcim.1c00244.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    ACS Publications
    Authors
    Tahereh G. Avval; Behnam Moeini; Victoria Carver; Neal Fairley; Emily F. Smith; Jonas Baltrusaitis; Vincent Fernandez; Bonnie. J. Tyler; Neal Gallagher; Matthew R. Linford
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Unsupervised exploratory data analysis (EDA) is often the first step in understanding complex data sets. While summary statistics are among the most efficient and convenient tools for exploring and describing sets of data, they are often overlooked in EDA. In this paper, we show multiple case studies that compare the performance, including clustering, of a series of summary statistics in EDA. The summary statistics considered here are pattern recognition entropy (PRE), the mean, standard deviation (STD), 1-norm, range, sum of squares (SSQ), and X4, which are compared with principal component analysis (PCA), multivariate curve resolution (MCR), and/or cluster analysis. PRE and the other summary statistics are direct methods for analyzing datathey are not factor-based approaches. To quantify the performance of summary statistics, we use the concept of the “critical pair,” which is employed in chromatography. The data analyzed here come from different analytical methods. Hyperspectral images, including one of a biological material, are also analyzed. In general, PRE outperforms the other summary statistics, especially in image analysis, although a suite of summary statistics is useful in exploring complex data sets. While PRE results were generally comparable to those from PCA and MCR, PRE is easier to apply. For example, there is no need to determine the number of factors that describe a data set. Finally, we introduce the concept of divided spectrum-PRE (DS-PRE) as a new EDA method. DS-PRE increases the discrimination power of PRE. We also show that DS-PRE can be used to provide the inputs for the k-nearest neighbor (kNN) algorithm. We recommend PRE and DS-PRE as rapid new tools for unsupervised EDA.

  10. d

    Data from: Research and exploratory analysis driven - time-data...

    • datadryad.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Jan 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Del Gaizo; Kenneth Catchpole; Alexander Alekseyenko (2022). Research and exploratory analysis driven - time-data visualization (read-tv) software [Dataset]. http://doi.org/10.5061/dryad.d51c5b02g
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 30, 2022
    Dataset provided by
    Dryad
    Authors
    John Del Gaizo; Kenneth Catchpole; Alexander Alekseyenko
    Time period covered
    Jan 25, 2021
    Description

    This section does not describe the methods of read-tv software development, which can be found in the associated manuscript from JAMIA Open (JAMIO-2020-0121.R1). This section describes the methods involved in the surgical work flow disruption data collection. A curated, PHI-free (protected health information) version of this dataset was used as a use case for this manuscript.

    Observer training

    Trained human factors researchers conducted each observation following the completion of observer training. The researchers were two full-time research assistants based in the department of surgery at site 3 who visited the other two sites to collect data. Human Factors experts guided and trained each observer in the identification and standardized collection of FDs. The observers were also trained in the basic components of robotic surgery in order to be able to tangibly isolate and describe such disruptive events.

    Comprehensive observer training was ensured with both classroom and floor train...

  11. f

    Data from: ftmsRanalysis: An R package for exploratory data analysis and...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Mar 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bramer, Lisa M.; Claborne, Daniel; Stratton, Kelly G.; Hofmockel, Kirsten; Thompson, Allison M.; McCue, Lee Ann; White, Amanda M. (2020). ftmsRanalysis: An R package for exploratory data analysis and interactive visualization of FT-MS data [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000452686
    Explore at:
    Dataset updated
    Mar 16, 2020
    Authors
    Bramer, Lisa M.; Claborne, Daniel; Stratton, Kelly G.; Hofmockel, Kirsten; Thompson, Allison M.; McCue, Lee Ann; White, Amanda M.
    Description

    The high-resolution and mass accuracy of Fourier transform mass spectrometry (FT-MS) has made it an increasingly popular technique for discerning the composition of soil, plant and aquatic samples containing complex mixtures of proteins, carbohydrates, lipids, lignins, hydrocarbons, phytochemicals and other compounds. Thus, there is a growing demand for informatics tools to analyze FT-MS data that will aid investigators seeking to understand the availability of carbon compounds to biotic and abiotic oxidation and to compare fundamental chemical properties of complex samples across groups. We present ftmsRanalysis, an R package which provides an extensive collection of data formatting and processing, filtering, visualization, and sample and group comparison functionalities. The package provides a suite of plotting methods and enables expedient, flexible and interactive visualization of complex datasets through functions which link to a powerful and interactive visualization user interface, Trelliscope. Example analysis using FT-MS data from a soil microbiology study demonstrates the core functionality of the package and highlights the capabilities for producing interactive visualizations.

  12. o

    Whistlerlib: a distributed computing library for exploratory data analysis...

    • repositorio.observatoriogeo.mx
    Updated Oct 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Whistlerlib: a distributed computing library for exploratory data analysis on large social network datasets - Dataset - Repositorio del Observatorio Metropolitano CentroGeo [Dataset]. http://repositorio.observatoriogeo.mx/dataset/1ee805b50082
    Explore at:
    Dataset updated
    Oct 21, 2025
    Description

    At least 350k posts are published on X, 510k comments are posted on Facebook, and 66k pictures and videos are shared on Instagram each minute. These large datasets require substantial processing power, even if only a percentage is collected for analysis and research. To face this challenge, data scientists can now use computer clusters deployed on various IaaS and PaaS services in the cloud. However, scientists still have to master the design of distributed algorithms and be familiar with using distributed computing programming frameworks. It is thus essential to generate tools that provide analysis methods to leverage the advantages of computer clusters for processing large amounts of social network text. This paper presents Whistlerlib, a new Python library for conducting exploratory analysis on large text datasets on social networks. Whistlerlib implements distributed versions of various social media, sentiment, and social network analysis methods that can run atop computer clusters. We experimentally demonstrate the scalability of the various Whistlerlib distributed methods when deployed on a public cloud platform. We also present a practical example of the analysis of posts on the social network X about the Mexico City subway to showcase the features of Whistlerlib in scenarios where social network analysis tools are needed to address issues with a social dimension.

  13. 2010 Census: Iowa Population by ZCTA

    • kaggle.com
    zip
    Updated May 2, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mark Mucchetti (2020). 2010 Census: Iowa Population by ZCTA [Dataset]. https://www.kaggle.com/markmucchetti/2010-census-iowa-population-by-zcta
    Explore at:
    zip(8057 bytes)Available download formats
    Dataset updated
    May 2, 2020
    Authors
    Mark Mucchetti
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Iowa
    Description

    Context

    This is a sample dataset derived from 2010 U.S. Government Census data. It is intended to be used in combination with example analyses on the public dataset "Iowa Liquor Sales", available as a Google Public Dataset, on Kaggle, and at https://data.iowa.gov/Sales-Distribution/Iowa-Liquor-Sales/m3tr-qhgy.

    Usage

    This dataset is intended for use as an example. Columns have purposely not been filtered by string manipulation in order to explore joining data between two pandas DataFrames and to do further processing.

    Because this data is at the Zip Code Tabulation Area (ZCTA) level, additional processing is required to join it with general-purpose datasets, which may be specified at the zip code, county name, county FIPS code, or coordinate level. This is intentional.

  14. Data from: Visualization for Interval Data

    • tandf.figshare.com
    txt
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muzi Zhang; Dennis K. J. Lin (2023). Visualization for Interval Data [Dataset]. http://doi.org/10.6084/m9.figshare.19617396.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Muzi Zhang; Dennis K. J. Lin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Interval data are widely used in many fields, notably in economics, industry, and health areas. Analogous to the scatterplot for single-value data, the rectangle plot and cross plot are the conventional visualization methods for the relationship between two variables in interval forms. These methods do not provide much information to assess complicated relationships, however. In this article, we propose two visualization methods: Segment and Dandelion plots. They offer much more information than the existing visualization methods and allow us to have a much better understanding of the relationship between two variables in interval forms. A general guide for reading these plots is provided. Relevant theoretical support is developed. Both empirical and real data examples are provided to demonstrate the advantages of the proposed visualization methods. Supplementary materials for this article are available online.

  15. f

    Data from: MAIN MINERALS AND ORGANIC COMPOUNDS IN COMMERCIAL ROASTED AND...

    • datasetcatalog.nlm.nih.gov
    • scielo.figshare.com
    Updated Mar 24, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Flores, Eder Lisandro Moraes; Leite, Oldair Donizete; Canan, Cristiane; Kalschne, Daneysa Lahis; de Toledo Benassi, Marta; Silva, Nathalia Karen (2021). MAIN MINERALS AND ORGANIC COMPOUNDS IN COMMERCIAL ROASTED AND GROUND COFFEE: AN EXPLORATORY DATA ANALYSIS [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000870684
    Explore at:
    Dataset updated
    Mar 24, 2021
    Authors
    Flores, Eder Lisandro Moraes; Leite, Oldair Donizete; Canan, Cristiane; Kalschne, Daneysa Lahis; de Toledo Benassi, Marta; Silva, Nathalia Karen
    Description

    Coffee is one of the most popular beverages in the world, however, little information is found regarding the mineral composition of commercial roasted and ground coffees (RG) and its correlation with organic bioactive compounds. 21 commercial Brazilian RG coffee brands - 9 traditional (T) and 12 extra strong (ES) roasted ones - were analyzed for the Cu, Ca, Mn, Mg, K, Zn, and Fe minerals, caffeine, 5-caffeoylquinic acid (5-CQA) and melanoidins contents. For minerals determination by flame atomic absorption spectrometry (FAAS), the samples were decomposed by microwave-assisted wet digestion. Caffeine and 5-CQA were determined by liquid chromatography and melanoidins by molecular absorption spectrometry. The minerals and organic compounds contents association in RG coffee was observed by a principal component analysis. The thermostable compounds (minerals and caffeine) were related to dimension 1 and 2, while 5-CQA and melanoidins were related to dimension 3, allowing for the T coffees segmentation from ES ones.

  16. Learn CV

    • kaggle.com
    zip
    Updated May 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sani Kamal (2022). Learn CV [Dataset]. https://www.kaggle.com/sanikamal/learn-cv
    Explore at:
    zip(46808328 bytes)Available download formats
    Dataset updated
    May 11, 2022
    Authors
    Sani Kamal
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Sani Kamal

    Released under CC0: Public Domain

    Contents

  17. A data analysis framework for biomedical big data: Application on mesoderm...

    • plos.figshare.com
    txt
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benjamin Ulfenborg; Alexander Karlsson; Maria Riveiro; Caroline Améen; Karolina Åkesson; Christian X. Andersson; Peter Sartipy; Jane Synnergren (2023). A data analysis framework for biomedical big data: Application on mesoderm differentiation of human pluripotent stem cells [Dataset]. http://doi.org/10.1371/journal.pone.0179613
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Benjamin Ulfenborg; Alexander Karlsson; Maria Riveiro; Caroline Améen; Karolina Åkesson; Christian X. Andersson; Peter Sartipy; Jane Synnergren
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The development of high-throughput biomolecular technologies has resulted in generation of vast omics data at an unprecedented rate. This is transforming biomedical research into a big data discipline, where the main challenges relate to the analysis and interpretation of data into new biological knowledge. The aim of this study was to develop a framework for biomedical big data analytics, and apply it for analyzing transcriptomics time series data from early differentiation of human pluripotent stem cells towards the mesoderm and cardiac lineages. To this end, transcriptome profiling by microarray was performed on differentiating human pluripotent stem cells sampled at eleven consecutive days. The gene expression data was analyzed using the five-stage analysis framework proposed in this study, including data preparation, exploratory data analysis, confirmatory analysis, biological knowledge discovery, and visualization of the results. Clustering analysis revealed several distinct expression profiles during differentiation. Genes with an early transient response were strongly related to embryonic- and mesendoderm development, for example CER1 and NODAL. Pluripotency genes, such as NANOG and SOX2, exhibited substantial downregulation shortly after onset of differentiation. Rapid induction of genes related to metal ion response, cardiac tissue development, and muscle contraction were observed around day five and six. Several transcription factors were identified as potential regulators of these processes, e.g. POU1F1, TCF4 and TBP for muscle contraction genes. Pathway analysis revealed temporal activity of several signaling pathways, for example the inhibition of WNT signaling on day 2 and its reactivation on day 4. This study provides a comprehensive characterization of biological events and key regulators of the early differentiation of human pluripotent stem cells towards the mesoderm and cardiac lineages. The proposed analysis framework can be used to structure data analysis in future research, both in stem cell differentiation, and more generally, in biomedical big data analytics.

  18. Breast Cancer Exploratory Data Analysis EDA

    • kaggle.com
    zip
    Updated Nov 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr. Nagendra (2025). Breast Cancer Exploratory Data Analysis EDA [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/breast-cancer-exploratory-data-analysis-eda
    Explore at:
    zip(7609364 bytes)Available download formats
    Dataset updated
    Nov 29, 2025
    Authors
    Dr. Nagendra
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains clinical and diagnostic features related to Breast Cancer, designed for comprehensive Exploratory Data Analysis (EDA) and subsequent predictive modeling.

    It is derived from digitized images of Fine Needle Aspirates (FNA) of breast masses.

    The dataset features quantitative measurements, typically calculated from the characteristics of cell nuclei, including: - Radius - Texture - Perimeter - Area - Smoothness - Compactness - Concavity - Concave Points - Symmetry - Fractal Dimension

    These features are provided as mean, standard error, and "worst" (largest) values.

    The primary goal of this resource is to support the validation of EDA techniques necessary for clinical data science: - Data quality assessment (missing values, inconsistencies). - Feature assessment (distributions, correlations). - Visualization for diagnostic modeling.

    The primary target variable is the binary classification of the tissue sample: Malignant vs. Benign.

  19. Streaming Service Data

    • kaggle.com
    Updated Dec 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chad Wambles (2024). Streaming Service Data [Dataset]. https://www.kaggle.com/datasets/chadwambles/streaming-service-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 19, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Chad Wambles
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    A dataset I generated to showcase a sample set of user data for a fictional streaming service. This data is great for practicing SQL, Excel, Tableau, or Power BI.

    1000 rows and 25 columns of connected data.

    See below for column descriptions.

    Enjoy :)

  20. f

    Data from: Visualizing Kendall’s τ and Hidden Structures in Ranked Data

    • figshare.com
    docx
    Updated Nov 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicholas D. Edwards; Enzo de Jong; Feng Liu; Stephen T. Ferguson (2025). Visualizing Kendall’s τ and Hidden Structures in Ranked Data [Dataset]. http://doi.org/10.6084/m9.figshare.30188503.v1
    Explore at:
    docxAvailable download formats
    Dataset updated
    Nov 12, 2025
    Dataset provided by
    Taylor & Francis
    Authors
    Nicholas D. Edwards; Enzo de Jong; Feng Liu; Stephen T. Ferguson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Ranked data is commonly used in research across many fields of study including medicine, biology, psychology, and economics. One common statistic used for analyzing ranked data is Kendall’s τ coefficient, a nonparametric measure of rank correlation which describes the strength of the association between two monotonic continuous or ordinal variables. While the mathematics involved in calculating Kendall’s τ is well-established, there are relatively few graphing methods available to visualize the results. Here, we describe several alternative and complementary visualization methods and provide an interactive app for graphing Kendall’s τ. The resulting graphs provide a visualization of rank correlation which helps display the proportion of concordant and discordant pairs. Moreover, these methods highlight other key features of the data which are not represented by Kendall’s τ alone but may nevertheless be meaningful, such as longer monotonic chains and the relationship between discrete pairs of observations. We demonstrate the utility of these approaches through several examples and compare our results to other visualization methods.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Shrishti Manja (2024). Ecommerce Dataset for Data Analysis [Dataset]. https://www.kaggle.com/datasets/shrishtimanja/ecommerce-dataset-for-data-analysis/code
Organization logo

Ecommerce Dataset for Data Analysis

Exploratory Data Analysis, Data Visualisation and Machine Learning

Explore at:
zip(2028853 bytes)Available download formats
Dataset updated
Sep 19, 2024
Authors
Shrishti Manja
Description

This dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.

About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.

Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.

This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.

This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning

Search
Clear search
Close search
Google apps
Main menu