100+ datasets found
  1. f

    DataSheet1_Exploratory data analysis (EDA) machine learning approaches for...

    • frontiersin.figshare.com
    docx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victoria Da Poian; Bethany Theiling; Lily Clough; Brett McKinney; Jonathan Major; Jingyi Chen; Sarah Hörst (2023). DataSheet1_Exploratory data analysis (EDA) machine learning approaches for ocean world analog mass spectrometry.docx [Dataset]. http://doi.org/10.3389/fspas.2023.1134141.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers
    Authors
    Victoria Da Poian; Bethany Theiling; Lily Clough; Brett McKinney; Jonathan Major; Jingyi Chen; Sarah Hörst
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    World
    Description

    Many upcoming and proposed missions to ocean worlds such as Europa, Enceladus, and Titan aim to evaluate their habitability and the existence of potential life on these moons. These missions will suffer from communication challenges and technology limitations. We review and investigate the applicability of data science and unsupervised machine learning (ML) techniques on isotope ratio mass spectrometry data (IRMS) from volatile laboratory analogs of Europa and Enceladus seawaters as a case study for development of new strategies for icy ocean world missions. Our driving science goal is to determine whether the mass spectra of volatile gases could contain information about the composition of the seawater and potential biosignatures. We implement data science and ML techniques to investigate what inherent information the spectra contain and determine whether a data science pipeline could be designed to quickly analyze data from future ocean worlds missions. In this study, we focus on the exploratory data analysis (EDA) step in the analytics pipeline. This is a crucial unsupervised learning step that allows us to understand the data in depth before subsequent steps such as predictive/supervised learning. EDA identifies and characterizes recurring patterns, significant correlation structure, and helps determine which variables are redundant and which contribute to significant variation in the lower dimensional space. In addition, EDA helps to identify irregularities such as outliers that might be due to poor data quality. We compared dimensionality reduction methods Uniform Manifold Approximation and Projection (UMAP) and Principal Component Analysis (PCA) for transforming our data from a high-dimensional space to a lower dimension, and we compared clustering algorithms for identifying data-driven groups (“clusters”) in the ocean worlds analog IRMS data and mapping these clusters to experimental conditions such as seawater composition and CO2 concentration. Such data analysis and characterization efforts are the first steps toward the longer-term science autonomy goal where similar automated ML tools could be used onboard a spacecraft to prioritize data transmissions for bandwidth-limited outer Solar System missions.

  2. Z

    Data Analysis for the Systematic Literature Review of DL4SE

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Washington and Lee University
    College of William and Mary
    Authors
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

    The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

    Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

    Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

    Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

    Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

    Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

    We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

    Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

    Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise

  3. Exploratory data analysis of a clinical study group: Development of a...

    • plos.figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bogumil M. Konopka; Felicja Lwow; Magdalena Owczarz; Łukasz Łaczmański (2023). Exploratory data analysis of a clinical study group: Development of a procedure for exploring multidimensional data [Dataset]. http://doi.org/10.1371/journal.pone.0201950
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Bogumil M. Konopka; Felicja Lwow; Magdalena Owczarz; Łukasz Łaczmański
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Thorough knowledge of the structure of analyzed data allows to form detailed scientific hypotheses and research questions. The structure of data can be revealed with methods for exploratory data analysis. Due to multitude of available methods, selecting those which will work together well and facilitate data interpretation is not an easy task. In this work we present a well fitted set of tools for a complete exploratory analysis of a clinical dataset and perform a case study analysis on a set of 515 patients. The proposed procedure comprises several steps: 1) robust data normalization, 2) outlier detection with Mahalanobis (MD) and robust Mahalanobis distances (rMD), 3) hierarchical clustering with Ward’s algorithm, 4) Principal Component Analysis with biplot vectors. The analyzed set comprised elderly patients that participated in the PolSenior project. Each patient was characterized by over 40 biochemical and socio-geographical attributes. Introductory analysis showed that the case-study dataset comprises two clusters separated along the axis of sex hormone attributes. Further analysis was carried out separately for male and female patients. The most optimal partitioning in the male set resulted in five subgroups. Two of them were related to diseased patients: 1) diabetes and 2) hypogonadism patients. Analysis of the female set suggested that it was more homogeneous than the male dataset. No evidence of pathological patient subgroups was found. In the study we showed that outlier detection with MD and rMD allows not only to identify outliers, but can also assess the heterogeneity of a dataset. The case study proved that our procedure is well suited for identification and visualization of biologically meaningful patient subgroups.

  4. Titanic- exploratory data analysis

    • kaggle.com
    zip
    Updated Jul 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Karthik (2025). Titanic- exploratory data analysis [Dataset]. https://www.kaggle.com/datasets/pandureddy123/titanic-exploratory-data-analysis
    Explore at:
    zip(962897 bytes)Available download formats
    Dataset updated
    Jul 19, 2025
    Authors
    Karthik
    Description

    One more step towards Machine learning! This is a titatic dataset with exploratory data analysis html file. I used pandas-profiling for fast analysis.

  5. SEM regression for H1-5.

    • plos.figshare.com
    xls
    Updated Nov 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn (2024). SEM regression for H1-5. [Dataset]. http://doi.org/10.1371/journal.pone.0309318.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 4, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Recent calls to take up data science either revolve around the superior predictive performance associated with machine learning or the potential of data science techniques for exploratory data analysis. Many believe that these strengths come at the cost of explanatory insights, which form the basis for theorization. In this paper, we show that this trade-off is false. When used as a part of a full research process, including inductive, deductive and abductive steps, machine learning can offer explanatory insights and provide a solid basis for theorization. We present a systematic five-step theory-building and theory-testing cycle that consists of: 1. Element identification (reduction); 2. Exploratory analysis (induction); 3. Hypothesis development (retroduction); 4. Hypothesis testing (deduction); and 5. Theorization (abduction). We demonstrate the usefulness of this approach, which we refer to as co-duction, in a vignette where we study firm growth with real-world observational data.

  6. Telemarketing Case

    • kaggle.com
    zip
    Updated Dec 3, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prabhakar (2018). Telemarketing Case [Dataset]. https://www.kaggle.com/datasets/prabhakar01/telemarketing-case
    Explore at:
    zip(1684198 bytes)Available download formats
    Dataset updated
    Dec 3, 2018
    Authors
    Prabhakar
    Description

    Dataset

    This dataset was created by Prabhakar

    Released under Other (specified in description)

    Contents

  7. u

    Data from: Supplementary Material for "Sonification for Exploratory Data...

    • pub.uni-bielefeld.de
    • search.datacite.org
    Updated Feb 5, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas Hermann (2019). Supplementary Material for "Sonification for Exploratory Data Analysis" [Dataset]. https://pub.uni-bielefeld.de/record/2920448
    Explore at:
    Dataset updated
    Feb 5, 2019
    Authors
    Thomas Hermann
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Sonification for Exploratory Data Analysis

    Chapter 8: Sonification Models

    In Chapter 8 of the thesis, 6 sonification models are presented to give some examples for the framework of Model-Based Sonification, developed in Chapter 7. Sonification models determine the rendering of the sonification and possible interactions. The "model in mind" helps the user to interprete the sound with respect to the data.

    8.1 Data Sonograms

    Data Sonograms use spherical expanding shock waves to excite linear oscillators which are represented by point masses in model space.

    • Table 8.2, page 87: Sound examples for Data Sonograms
    File:
    Iris dataset: started in plot "https://pub.uni-bielefeld.de/download/2920448/2920454">(a) at S0 (b) at S1 (c) at S2
    10d noisy circle dataset: started in plot (c) at "https://pub.uni-bielefeld.de/download/2920448/2920451">S0 (mean) (d) at S1 (edge)
    10d Gaussian: plot (d) started at S0
    3 clusters: Example 1
    3 clusters: invisible columns used as output variables: "https://pub.uni-bielefeld.de/download/2920448/2920450">Example 2
    Description:
    Data Sonogram Sound examples for synthetic datasets and the Iris dataset
    Duration:
    about 5 s
    8.2 Particle Trajectory Sonification Model

    This sonification model explores features of a data distribution by computing the trajectories of test particles which are injected into model space and move according to Newton's laws of motion in a potential given by the dataset.

    • Sound example: page 93, PTSM-Ex-1 Audification of 1 particle in the potential of phi(x).
    • Sound example: page 93, PTSM-Ex-2 Audification of a sequence of 15 particles in the potential of a dataset with 2 clusters.
    • Sound example: page 94, PTSM-Ex-3 Audification of 25 particles simultaneous in a potential of a dataset with 2 clusters.
    • Sound example: page 94, PTSM-Ex-4 Audification of 25 particles simultaneous in a potential of a dataset with 1 cluster.
    • Sound example: page 95, PTSM-Ex-5 sigma-step sequence for a mixture of three Gaussian clusters
    • Sound example: page 95, PTSM-Ex-6 sigma-step sequence for a Gaussian cluster
    • Sound example: page 96, PTSM-Iris-1 Sonification for the Iris Dataset with 20 particles per step.
    • Sound example: page 96, PTSM-Iris-2 Sonification for the Iris Dataset with 3 particles per step.
    • Sound example: page 96, PTSM-Tetra-1 Sonification for a 4d tetrahedron clusters dataset.
    8.3 Markov chain Monte Carlo Sonification

    The McMC Sonification Model defines a exploratory process in the domain of a given density p such that the acoustic representation summarizes features of p, particularly concerning the modes of p by sound.

    • Sound Example: page 105, MCMC-Ex-1 McMC Sonification, stabilization of amplitudes.
    • Sound Example: page 106, MCMC-Ex-2 Trajectory Audification for 100 McMC steps in 3 cluster dataset
    • McMC Sonification for Cluster Analysis, dataset with three clusters, page 107
    • McMC Sonification for Cluster
  8. Theatres_Data_Version_1.0

    • kaggle.com
    zip
    Updated Aug 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    deanesh takkallapati (2025). Theatres_Data_Version_1.0 [Dataset]. https://www.kaggle.com/datasets/deaneshtakkallapati/theatres-data-version-1-0/data
    Explore at:
    zip(30479 bytes)Available download formats
    Dataset updated
    Aug 4, 2025
    Authors
    deanesh takkallapati
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Exploratory Data Analysis (EDA) of Theatres Data in India

    Steps
      1. Load Data
      2. Check Nulls and Update Data if required
      3. Perform Descriptive Statistics
      4. Data Visualization
         Univariate - Single Column Visualization
           categorical - countplot
           continuous - histogram
         Bivariate - 2 Columns Visualization
          continuous vs continuous  - scatterplot, regplot
          categorical vs continuous  - boxplot
          categorical vs categorical - crosstab, heatmap
         Multivariate - Multi Columns Visualization
          correlation plot
          pairplot
    
  9. Additional file 1 of Simple but powerful interactive data analysis in R with...

    • springernature.figshare.com
    zip
    Updated Nov 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Svetlana Ovchinnikova; Simon Anders (2024). Additional file 1 of Simple but powerful interactive data analysis in R with R/LinkedCharts [Dataset]. http://doi.org/10.6084/m9.figshare.26677037.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 26, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Svetlana Ovchinnikova; Simon Anders
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 1. Zip file containing the interactive supplement.

  10. Exploratory Data Analysis: Crop Classification & Forecasting - EDA

    • ai.tracebloc.io
    json
    Updated Nov 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    tracebloc (2025). Exploratory Data Analysis: Crop Classification & Forecasting - EDA [Dataset]. https://ai.tracebloc.io/explore/ai-satellite-crop-classification-and-yield-forecasting?tab=exploratory-data-analysis
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Nov 17, 2025
    Dataset provided by
    Tracebloc GmbH
    Authors
    tracebloc
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Missing Values
    Measurement technique
    Statistical and exploratory data analysis
    Description

    Understand the satellite dataset before training. Explore distributions, preprocessing steps, and key insights to evaluate 3rd-party models effectively.

  11. The five-step co-duction cycle.

    • plos.figshare.com
    xls
    Updated Nov 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn (2024). The five-step co-duction cycle. [Dataset]. http://doi.org/10.1371/journal.pone.0309318.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 4, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Recent calls to take up data science either revolve around the superior predictive performance associated with machine learning or the potential of data science techniques for exploratory data analysis. Many believe that these strengths come at the cost of explanatory insights, which form the basis for theorization. In this paper, we show that this trade-off is false. When used as a part of a full research process, including inductive, deductive and abductive steps, machine learning can offer explanatory insights and provide a solid basis for theorization. We present a systematic five-step theory-building and theory-testing cycle that consists of: 1. Element identification (reduction); 2. Exploratory analysis (induction); 3. Hypothesis development (retroduction); 4. Hypothesis testing (deduction); and 5. Theorization (abduction). We demonstrate the usefulness of this approach, which we refer to as co-duction, in a vignette where we study firm growth with real-world observational data.

  12. Groups of words for our Z and X variables.

    • plos.figshare.com
    xls
    Updated Nov 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn (2024). Groups of words for our Z and X variables. [Dataset]. http://doi.org/10.1371/journal.pone.0309318.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 4, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Recent calls to take up data science either revolve around the superior predictive performance associated with machine learning or the potential of data science techniques for exploratory data analysis. Many believe that these strengths come at the cost of explanatory insights, which form the basis for theorization. In this paper, we show that this trade-off is false. When used as a part of a full research process, including inductive, deductive and abductive steps, machine learning can offer explanatory insights and provide a solid basis for theorization. We present a systematic five-step theory-building and theory-testing cycle that consists of: 1. Element identification (reduction); 2. Exploratory analysis (induction); 3. Hypothesis development (retroduction); 4. Hypothesis testing (deduction); and 5. Theorization (abduction). We demonstrate the usefulness of this approach, which we refer to as co-duction, in a vignette where we study firm growth with real-world observational data.

  13. Daily Machine Learning Practice

    • kaggle.com
    zip
    Updated Nov 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Astrid Villalobos (2025). Daily Machine Learning Practice [Dataset]. https://www.kaggle.com/datasets/astridvillalobos/daily-machine-learning-practice
    Explore at:
    zip(1019861 bytes)Available download formats
    Dataset updated
    Nov 9, 2025
    Authors
    Astrid Villalobos
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Daily Machine Learning Practice – 1 Commit per Day

    Author: Astrid Villalobos Location: Montréal, QC LinkedIn: https://www.linkedin.com/in/astridcvr/

    Objective The goal of this project is to strengthen Machine Learning and data analysis skills through small, consistent daily contributions. Each commit focuses on a specific aspect of data processing, feature engineering, or modeling using Python, Pandas, and Scikit-learn.

    Dataset Source: Kaggle – Sample Sales Data File: data/sales_data_sample.csv Variables: ORDERNUMBER, QUANTITYORDERED, PRICEEACH, SALES, COUNTRY, etc. Goal: Analyze e-commerce performance, predict sales trends, segment customers, and forecast demand.

    **Project Rules **Rule Description 🟩 1 Commit per Day Minimum one line of code daily to ensure consistency and discipline 🌍 Bilingual Comments Code and documentation in English and French 📈 Visible Progress Daily green squares = daily learning 🧰 Tech Stack

    Languages: Python Libraries: Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn Tools: Jupyter Notebook, GitHub, Kaggle

    Learning Outcomes By the end of this challenge: Develop a stronger understanding of data preprocessing, modeling, and evaluation. Build consistent coding habits through daily practice. Apply ML techniques to real-world sales data scenarios.

  14. r

    Exploratory data analysis of infrared spectra from 3D-printing polymers

    • researchdata.edu.au
    Updated Oct 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simon Lewis; Michael V. Adamos; Kari Pitts; Georgina Sauzier (2025). Exploratory data analysis of infrared spectra from 3D-printing polymers [Dataset]. http://doi.org/10.25917/FN6A-AZ80
    Explore at:
    Dataset updated
    Oct 31, 2025
    Dataset provided by
    Curtin University
    Authors
    Simon Lewis; Michael V. Adamos; Kari Pitts; Georgina Sauzier
    Description

    Data description: This dataset consists of spectroscopic data files and associated R-scripts for exploratory data analysis. Attenuated total reflectance Fourier transform infrared (ATR-FTIR) spectra were collected from 67 samples of polymer filaments potentially used to produce illicit 3D-printed items. Principal component analysis (PCA) was used to determine if any individual filaments gave distinctive spectral signatures, potentially allowing traceability of 3D-printed items for forensic purposes. The project also investigated potential chemical variations induced by the filament manufacturing or 3D-printing process. Data was collected and analysed by Michael Adamos at Curtin University (Perth, Western Australia), under the supervision of Dr Georgina Sauzier and Prof. Simon Lewis and with specialist input from Dr Kari Pitts.

    Data collection time details: 2024
    Number of files/types: 3 .R files, 702 .JDX files
    Geographic information (if relevant): Australia
    Keywords: 3D printing, polymers, infrared spectroscopy, forensic science

  15. Wetlands Ecological Integrity Depth To Water Data - Great Sand Dunes...

    • catalog.data.gov
    Updated Nov 25, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Park Service (2025). Wetlands Ecological Integrity Depth To Water Data - Great Sand Dunes National Park 2009-2019 [Dataset]. https://catalog.data.gov/dataset/wetlands-ecological-integrity-depth-to-water-data-great-sand-dunes-national-park-2009-2019
    Explore at:
    Dataset updated
    Nov 25, 2025
    Dataset provided by
    National Park Servicehttp://www.nps.gov/
    Description

    Wetlands Ecological Integrity Depth to Water Logger data from 2009-2019 at Great Sand Dunes National Park. This includes Raw dataset (primarily hourly), daily summaries, weekly summaries, and monthly summaries. Included in the data package are exploratory data analysis figures at the daily, weekly and monthly time steps. Lastly included is the R code used to extract the depth to water logger data from the National Park Service Aquarius data system, and to create the exploratory data analysis figures.

  16. Wetlands Ecological Integrity Depth To Water Data - Florissant Fossil Beds...

    • catalog.data.gov
    Updated Nov 25, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Park Service (2025). Wetlands Ecological Integrity Depth To Water Data - Florissant Fossil Beds National Monument 2009-2019 [Dataset]. https://catalog.data.gov/dataset/wetlands-ecological-integrity-depth-to-water-data-florissant-fossil-beds-national-mon-2009
    Explore at:
    Dataset updated
    Nov 25, 2025
    Dataset provided by
    National Park Servicehttp://www.nps.gov/
    Area covered
    Florissant
    Description

    Wetlands Ecological Integrity Depth to Water Logger data from 2009-2019 at Florissant Fossil Beds National Monument. This includes Raw dataset (primarily hourly), daily summaries, weekly summaries, and monthly summaries. Included in the data package are exploratory data analysis figures at the daily, weekly and monthly time steps. Lastly included is the R code used to extract the depth to water logger data from the National Park Service Aquarius data system, and to create the exploratory data analysis figures.

  17. f

    Data from: The Often-Overlooked Power of Summary Statistics in Exploratory...

    • acs.figshare.com
    xlsx
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tahereh G. Avval; Behnam Moeini; Victoria Carver; Neal Fairley; Emily F. Smith; Jonas Baltrusaitis; Vincent Fernandez; Bonnie. J. Tyler; Neal Gallagher; Matthew R. Linford (2023). The Often-Overlooked Power of Summary Statistics in Exploratory Data Analysis: Comparison of Pattern Recognition Entropy (PRE) to Other Summary Statistics and Introduction of Divided Spectrum-PRE (DS-PRE) [Dataset]. http://doi.org/10.1021/acs.jcim.1c00244.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    ACS Publications
    Authors
    Tahereh G. Avval; Behnam Moeini; Victoria Carver; Neal Fairley; Emily F. Smith; Jonas Baltrusaitis; Vincent Fernandez; Bonnie. J. Tyler; Neal Gallagher; Matthew R. Linford
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Unsupervised exploratory data analysis (EDA) is often the first step in understanding complex data sets. While summary statistics are among the most efficient and convenient tools for exploring and describing sets of data, they are often overlooked in EDA. In this paper, we show multiple case studies that compare the performance, including clustering, of a series of summary statistics in EDA. The summary statistics considered here are pattern recognition entropy (PRE), the mean, standard deviation (STD), 1-norm, range, sum of squares (SSQ), and X4, which are compared with principal component analysis (PCA), multivariate curve resolution (MCR), and/or cluster analysis. PRE and the other summary statistics are direct methods for analyzing datathey are not factor-based approaches. To quantify the performance of summary statistics, we use the concept of the “critical pair,” which is employed in chromatography. The data analyzed here come from different analytical methods. Hyperspectral images, including one of a biological material, are also analyzed. In general, PRE outperforms the other summary statistics, especially in image analysis, although a suite of summary statistics is useful in exploring complex data sets. While PRE results were generally comparable to those from PCA and MCR, PRE is easier to apply. For example, there is no need to determine the number of factors that describe a data set. Finally, we introduce the concept of divided spectrum-PRE (DS-PRE) as a new EDA method. DS-PRE increases the discrimination power of PRE. We also show that DS-PRE can be used to provide the inputs for the k-nearest neighbor (kNN) algorithm. We recommend PRE and DS-PRE as rapid new tools for unsupervised EDA.

  18. e

    Exploratory Data Analytics and Descriptive Statistics

    • paper.erudition.co.in
    html
    Updated Dec 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Einetic (2025). Exploratory Data Analytics and Descriptive Statistics [Dataset]. https://paper.erudition.co.in/makaut/bachelor-in-business-administration-2020-2021/5/data-analytics-skills-for-managers
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Dec 3, 2025
    Dataset authored and provided by
    Einetic
    License

    https://paper.erudition.co.in/termshttps://paper.erudition.co.in/terms

    Description

    Question Paper Solutions of chapter Exploratory Data Analytics and Descriptive Statistics of Data Analytics Skills for Managers, 5th Semester , Bachelor in Business Administration 2020 - 2021

  19. H

    Replication Data for "Best Practices for Your Exploratory Factor Analysis: a...

    • dataverse.harvard.edu
    Updated Nov 9, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pablo Rogers (2021). Replication Data for "Best Practices for Your Exploratory Factor Analysis: a Factor Tutorial" published by RAC-Revista de Administração Contemporânea [Dataset]. http://doi.org/10.7910/DVN/RCX8FF
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 9, 2021
    Dataset provided by
    Harvard Dataverse
    Authors
    Pablo Rogers
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This repository contains material related to the analysis performed in the article "Best Practices for Your Exploratory Factor Analysis: a Factor Tutorial". The material includes the data used in the analyses in .dat format, the labels (.txt) of the variables used in the Factor software, the outputs (.txt) evaluated in the article, and videos (.mp4 with English subtitles) recorded for the purpose of explaining the article. The videos can also be accessed in the following playlist: https://youtube.com/playlist?list=PLln41V0OsLHbSlYcDszn2PoTSiAwV5Oda. Below is a summary of the article: "Exploratory Factor Analysis (EFA) is one of the statistical methods most widely used in Administration, however, its current practice coexists with rules of thumb and heuristics given half a century ago. The purpose of this article is to present the best practices and recent recommendations for a typical EFA in Administration through a practical solution accessible to researchers. In this sense, in addition to discussing current practices versus recommended practices, a tutorial with real data on Factor is illustrated, a software that is still little known in the Administration area, but freeware, easy to use (point and click) and powerful. The step-by-step illustrated in the article, in addition to the discussions raised and an additional example, is also available in the format of tutorial videos. Through the proposed didactic methodology (article-tutorial + video-tutorial), we encourage researchers/methodologists who have mastered a particular technique to do the same. Specifically, about EFA, we hope that the presentation of the Factor software, as a first solution, can transcend the current outdated rules of thumb and heuristics, by making best practices accessible to Administration researchers". STEPS TO REPRODUCE This repository is composed of four types of files: 1) three video files in .mp4 format (with English subtitles), which discuss the article and the extra example mentioned in it; 2) two databases in .dat format: i) 1047 observations with 24 variables of the WHOQOL instrument discussed in the article; and ii) 918 observations with 10 variables of the FWB scale (extra example); 3) two labels files (.txt format) to be incorporated into the Factor software; and 4) five output files in .txt format. The steps are: 1st: Read the article “Best Practices for Your Exploratory Factor Analysis: a Factor Tutorial”. DOI: 10.1590/1982-7849rac2022210085.en; OR 1st: Watch the videos: i) 1_Video_BestPractices.mp4 (https://youtu.be/ITh1w4tFerA); and ii) 2_Video_MultidimensionalExample.mp4 (https://youtu.be/9X77ARoyys0); 2nd: Insert the database WHOQOL_Data.dat into the Factor software and, optionally, the label file WHOQOL_Labels.txt, as explained in section 4.2 of the article or in the section that begins at the timestamp 6:35 of the video 2_Video_MultidimensionalExample.mp4 (https://youtu.be/9X77ARoyys0?t=395); 3rd: Configure the analyses as explained in section 4.3 of the article or in the section that begins at the timestamp 10:45 of the video 2_Video_MultidimensionalExample.mp4 (https://youtu.be/9X77ARoyys0?t=645); 4th: Interpret the first output file (1_Output_WHOQOL_4Factors.txt) as explained in section 4.4 of the article or in the section that begins at the timestamp 20:45 of the video 2_Video_MultidimensionalExample.mp4 (https://youtu.be/9X77ARoyys0?t=1245); 5th: Interpret the second output file (2_Output_WHOQOL_2Factors.txt) as explained in the section that starts at the timestamp 49:53 of the video 2_Video_MultidimensionalExample.mp4 (https://youtu.be/9X77ARoyys0?t=2993); 6th: Interpret the third output file (3_Output_WHOQOL_2Factors_Ajusted.txt) as explained in the section that starts at the timestamp 1:05:45 of the video 2_Video_MultidimensionalExample.mp4 (https://youtu.be/9X77ARoyys0?t=3945); and 7th: Interpret the fourth output file (4_Output_WHOQOL_2Factors_Bifactor.txt) as explained in the section that starts at the timestamp 1:13:14 of the video 2_Video_MultidimensionalExample.mp4 (https://youtu.be/9X77ARoyys0?t=4394); OR, optionally, to replicate the extra example mentioned in the article: 8th: Insert the database FWB_Data.dat into the Factor software and, optionally, the label file FWB_Labels.txt, as explained in the section that starts at the timestamp 4:50 of the video 3_Video_UnidimensionalExample.mp4 (https://youtu.be/wFTGJG8XRRs?t=290); 9th: Configure the analyses as explained in the section that starts at the timestamp 8:32 of the video 3_Video_UnidimensionalExample.mp4 (https://youtu.be/wFTGJG8XRRs?t=512); and 10th: Interpret the output file FWB_Output.txt as explained in the section that begins at the timestamp 22:58 of the video 3_Video_UnidimensionalExample.mp4 (https://youtu.be/wFTGJG8XRRs?t=1378).

  20. g

    Wetlands Ecological Integrity Depth To Water Data - Rocky Mountain National...

    • gimi9.com
    Updated Mar 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Wetlands Ecological Integrity Depth To Water Data - Rocky Mountain National Park 2007-2019 | gimi9.com [Dataset]. https://gimi9.com/dataset/data-gov_wetlands-ecological-integrity-depth-to-water-data-rocky-mountain-national-park-2007-2019
    Explore at:
    Dataset updated
    Mar 2, 2021
    Area covered
    Rocky Mountains
    Description

    Wetlands Ecological Integrity Depth to Water Logger data from 2007-2019 at Rocky Mountain National Park. This includes Raw dataset (primarily hourly), daily summaries, weekly summaries, and monthly summaries. Included in the data package are exploratory data analysis figures at the daily, weekly and monthly time steps. Lastly included is the R code used to extract the depth to water logger data from the National Park Service Aquarius data system, and to create the exploratory data analysis figures.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Victoria Da Poian; Bethany Theiling; Lily Clough; Brett McKinney; Jonathan Major; Jingyi Chen; Sarah Hörst (2023). DataSheet1_Exploratory data analysis (EDA) machine learning approaches for ocean world analog mass spectrometry.docx [Dataset]. http://doi.org/10.3389/fspas.2023.1134141.s001

DataSheet1_Exploratory data analysis (EDA) machine learning approaches for ocean world analog mass spectrometry.docx

Related Article
Explore at:
docxAvailable download formats
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Victoria Da Poian; Bethany Theiling; Lily Clough; Brett McKinney; Jonathan Major; Jingyi Chen; Sarah Hörst
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered
World
Description

Many upcoming and proposed missions to ocean worlds such as Europa, Enceladus, and Titan aim to evaluate their habitability and the existence of potential life on these moons. These missions will suffer from communication challenges and technology limitations. We review and investigate the applicability of data science and unsupervised machine learning (ML) techniques on isotope ratio mass spectrometry data (IRMS) from volatile laboratory analogs of Europa and Enceladus seawaters as a case study for development of new strategies for icy ocean world missions. Our driving science goal is to determine whether the mass spectra of volatile gases could contain information about the composition of the seawater and potential biosignatures. We implement data science and ML techniques to investigate what inherent information the spectra contain and determine whether a data science pipeline could be designed to quickly analyze data from future ocean worlds missions. In this study, we focus on the exploratory data analysis (EDA) step in the analytics pipeline. This is a crucial unsupervised learning step that allows us to understand the data in depth before subsequent steps such as predictive/supervised learning. EDA identifies and characterizes recurring patterns, significant correlation structure, and helps determine which variables are redundant and which contribute to significant variation in the lower dimensional space. In addition, EDA helps to identify irregularities such as outliers that might be due to poor data quality. We compared dimensionality reduction methods Uniform Manifold Approximation and Projection (UMAP) and Principal Component Analysis (PCA) for transforming our data from a high-dimensional space to a lower dimension, and we compared clustering algorithms for identifying data-driven groups (“clusters”) in the ocean worlds analog IRMS data and mapping these clusters to experimental conditions such as seawater composition and CO2 concentration. Such data analysis and characterization efforts are the first steps toward the longer-term science autonomy goal where similar automated ML tools could be used onboard a spacecraft to prioritize data transmissions for bandwidth-limited outer Solar System missions.

Search
Clear search
Close search
Google apps
Main menu