100+ datasets found
  1. f

    DataSheet1_Exploratory data analysis (EDA) machine learning approaches for...

    • frontiersin.figshare.com
    docx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victoria Da Poian; Bethany Theiling; Lily Clough; Brett McKinney; Jonathan Major; Jingyi Chen; Sarah Hörst (2023). DataSheet1_Exploratory data analysis (EDA) machine learning approaches for ocean world analog mass spectrometry.docx [Dataset]. http://doi.org/10.3389/fspas.2023.1134141.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers
    Authors
    Victoria Da Poian; Bethany Theiling; Lily Clough; Brett McKinney; Jonathan Major; Jingyi Chen; Sarah Hörst
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    World
    Description

    Many upcoming and proposed missions to ocean worlds such as Europa, Enceladus, and Titan aim to evaluate their habitability and the existence of potential life on these moons. These missions will suffer from communication challenges and technology limitations. We review and investigate the applicability of data science and unsupervised machine learning (ML) techniques on isotope ratio mass spectrometry data (IRMS) from volatile laboratory analogs of Europa and Enceladus seawaters as a case study for development of new strategies for icy ocean world missions. Our driving science goal is to determine whether the mass spectra of volatile gases could contain information about the composition of the seawater and potential biosignatures. We implement data science and ML techniques to investigate what inherent information the spectra contain and determine whether a data science pipeline could be designed to quickly analyze data from future ocean worlds missions. In this study, we focus on the exploratory data analysis (EDA) step in the analytics pipeline. This is a crucial unsupervised learning step that allows us to understand the data in depth before subsequent steps such as predictive/supervised learning. EDA identifies and characterizes recurring patterns, significant correlation structure, and helps determine which variables are redundant and which contribute to significant variation in the lower dimensional space. In addition, EDA helps to identify irregularities such as outliers that might be due to poor data quality. We compared dimensionality reduction methods Uniform Manifold Approximation and Projection (UMAP) and Principal Component Analysis (PCA) for transforming our data from a high-dimensional space to a lower dimension, and we compared clustering algorithms for identifying data-driven groups (“clusters”) in the ocean worlds analog IRMS data and mapping these clusters to experimental conditions such as seawater composition and CO2 concentration. Such data analysis and characterization efforts are the first steps toward the longer-term science autonomy goal where similar automated ML tools could be used onboard a spacecraft to prioritize data transmissions for bandwidth-limited outer Solar System missions.

  2. Ecommerce Dataset for Data Analysis

    • kaggle.com
    zip
    Updated Sep 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shrishti Manja (2024). Ecommerce Dataset for Data Analysis [Dataset]. https://www.kaggle.com/datasets/shrishtimanja/ecommerce-dataset-for-data-analysis/code
    Explore at:
    zip(2028853 bytes)Available download formats
    Dataset updated
    Sep 19, 2024
    Authors
    Shrishti Manja
    Description

    This dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.

    About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.

    Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.

    This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.

    This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning

  3. E

    Exploratory Data Analysis (EDA) Tools Report

    • marketreportanalytics.com
    doc, pdf, ppt
    Updated Apr 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Report Analytics (2025). Exploratory Data Analysis (EDA) Tools Report [Dataset]. https://www.marketreportanalytics.com/reports/exploratory-data-analysis-eda-tools-54164
    Explore at:
    ppt, pdf, docAvailable download formats
    Dataset updated
    Apr 2, 2025
    Dataset authored and provided by
    Market Report Analytics
    License

    https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Exploratory Data Analysis (EDA) tools market is experiencing robust growth, driven by the increasing volume and complexity of data across various industries. The market, estimated at $1.5 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, reaching approximately $5 billion by 2033. This expansion is fueled by several key factors. Firstly, the rising adoption of big data analytics and business intelligence initiatives across large enterprises and SMEs is creating a significant demand for efficient EDA tools. Secondly, the growing need for faster, more insightful data analysis to support better decision-making is driving the preference for user-friendly graphical EDA tools over traditional non-graphical methods. Furthermore, advancements in artificial intelligence and machine learning are seamlessly integrating into EDA tools, enhancing their capabilities and broadening their appeal. The market segmentation reveals a significant portion held by large enterprises, reflecting their greater resources and data handling needs. However, the SME segment is rapidly gaining traction, driven by the increasing affordability and accessibility of cloud-based EDA solutions. Geographically, North America currently dominates the market, but regions like Asia-Pacific are exhibiting high growth potential due to increasing digitalization and technological advancements. Despite this positive outlook, certain restraints remain. The high initial investment cost associated with implementing advanced EDA solutions can be a barrier for some SMEs. Additionally, the need for skilled professionals to effectively utilize these tools can create a challenge for organizations. However, the ongoing development of user-friendly interfaces and the availability of training resources are actively mitigating these limitations. The competitive landscape is characterized by a mix of established players like IBM and emerging innovative companies offering specialized solutions. Continuous innovation in areas like automated data preparation and advanced visualization techniques will further shape the future of the EDA tools market, ensuring its sustained growth trajectory.

  4. E

    Exploratory Data Analysis (EDA) Tools Report

    • marketreportanalytics.com
    doc, pdf, ppt
    Updated Apr 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Report Analytics (2025). Exploratory Data Analysis (EDA) Tools Report [Dataset]. https://www.marketreportanalytics.com/reports/exploratory-data-analysis-eda-tools-54369
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Apr 2, 2025
    Dataset authored and provided by
    Market Report Analytics
    License

    https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    Discover the booming Exploratory Data Analysis (EDA) tools market! Our in-depth analysis reveals key trends, growth drivers, and top players shaping this $3 billion industry, projected for 15% CAGR through 2033. Learn about market segmentation, regional insights, and future opportunities.

  5. E

    Exploratory Data Analysis (EDA) Tools Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Feb 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Exploratory Data Analysis (EDA) Tools Report [Dataset]. https://www.archivemarketresearch.com/reports/exploratory-data-analysis-eda-tools-21680
    Explore at:
    doc, pdf, pptAvailable download formats
    Dataset updated
    Feb 12, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global Exploratory Data Analysis (EDA) Tools market is anticipated to experience significant growth in the coming years, driven by the increasing adoption of data-driven decision-making and the growing need for efficient data exploration and analysis. The market size is valued at USD XX million in 2025 and is projected to reach USD XX million by 2033, registering a CAGR of XX% during the forecast period. The increasing complexity and volume of data generated by businesses and organizations have necessitated the use of advanced data analysis tools to derive meaningful insights and make informed decisions. Key trends driving the market include the rising adoption of AI and machine learning technologies, the growing need for self-service data analytics, and the increasing emphasis on data visualization and storytelling. Non-graphical EDA tools are gaining traction due to their ability to handle large and complex datasets. Graphical EDA tools are preferred for their intuitive and interactive user interfaces that simplify data exploration. Large enterprises are major consumers of EDA tools as they have large volumes of data to analyze. SMEs are also increasingly adopting EDA tools as they realize the importance of data-driven insights for business growth. The North American region holds a significant market share due to the presence of established technology companies and a high adoption rate of data analytics solutions. The Asia Pacific region is expected to witness substantial growth due to the rising number of businesses and organizations in emerging economies.

  6. Exploratory data analysis of a clinical study group: Development of a...

    • plos.figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bogumil M. Konopka; Felicja Lwow; Magdalena Owczarz; Łukasz Łaczmański (2023). Exploratory data analysis of a clinical study group: Development of a procedure for exploring multidimensional data [Dataset]. http://doi.org/10.1371/journal.pone.0201950
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Bogumil M. Konopka; Felicja Lwow; Magdalena Owczarz; Łukasz Łaczmański
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Thorough knowledge of the structure of analyzed data allows to form detailed scientific hypotheses and research questions. The structure of data can be revealed with methods for exploratory data analysis. Due to multitude of available methods, selecting those which will work together well and facilitate data interpretation is not an easy task. In this work we present a well fitted set of tools for a complete exploratory analysis of a clinical dataset and perform a case study analysis on a set of 515 patients. The proposed procedure comprises several steps: 1) robust data normalization, 2) outlier detection with Mahalanobis (MD) and robust Mahalanobis distances (rMD), 3) hierarchical clustering with Ward’s algorithm, 4) Principal Component Analysis with biplot vectors. The analyzed set comprised elderly patients that participated in the PolSenior project. Each patient was characterized by over 40 biochemical and socio-geographical attributes. Introductory analysis showed that the case-study dataset comprises two clusters separated along the axis of sex hormone attributes. Further analysis was carried out separately for male and female patients. The most optimal partitioning in the male set resulted in five subgroups. Two of them were related to diseased patients: 1) diabetes and 2) hypogonadism patients. Analysis of the female set suggested that it was more homogeneous than the male dataset. No evidence of pathological patient subgroups was found. In the study we showed that outlier detection with MD and rMD allows not only to identify outliers, but can also assess the heterogeneity of a dataset. The case study proved that our procedure is well suited for identification and visualization of biologically meaningful patient subgroups.

  7. Titanic- exploratory data analysis

    • kaggle.com
    zip
    Updated Jul 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Karthik (2025). Titanic- exploratory data analysis [Dataset]. https://www.kaggle.com/datasets/pandureddy123/titanic-exploratory-data-analysis
    Explore at:
    zip(962897 bytes)Available download formats
    Dataset updated
    Jul 19, 2025
    Authors
    Karthik
    Description

    One more step towards Machine learning! This is a titatic dataset with exploratory data analysis html file. I used pandas-profiling for fast analysis.

  8. Z

    Data Analysis for the Systematic Literature Review of DL4SE

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Washington and Lee University
    College of William and Mary
    Authors
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

    The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

    Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

    Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

    Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

    Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

    Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

    We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

    Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

    Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise

  9. E

    Exploratory Data Analysis (EDA) Tools Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Nov 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Exploratory Data Analysis (EDA) Tools Report [Dataset]. https://www.datainsightsmarket.com/reports/exploratory-data-analysis-eda-tools-532159
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    Nov 7, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    Explore the booming Exploratory Data Analysis (EDA) Tools market, projected to reach $10.5 billion by 2025 with a 12.5% CAGR. Discover key drivers, trends, and market share for large enterprises, SMEs, graphical & non-graphical tools across North America, Europe, APAC, and more.

  10. Cars93

    • kaggle.com
    zip
    Updated Sep 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yashpal (2022). Cars93 [Dataset]. https://www.kaggle.com/datasets/yashpaloswal/cars93/discussion
    Explore at:
    zip(4879 bytes)Available download formats
    Dataset updated
    Sep 16, 2022
    Authors
    Yashpal
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Content:- The file contains basic cars details

    Goal:- You can do multiple things using this dataset such as 1. Missing data treatment 2. Various Pandas operations (to learn; the basic concepts) 3. EDA 4. You can choose to run any machine learning algorithm, considering any features and any label.

    The basic purpose of this dataset is to get started in the field of data science and machine learning.

  11. Data Analysis in R

    • kaggle.com
    zip
    Updated May 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rajdeep Kaur Bajwa (2022). Data Analysis in R [Dataset]. https://www.kaggle.com/datasets/rajdeepkaurbajwa/data-analysis-r
    Explore at:
    zip(5321 bytes)Available download formats
    Dataset updated
    May 16, 2022
    Authors
    Rajdeep Kaur Bajwa
    Description

    Dataset

    This dataset was created by Rajdeep Kaur Bajwa

    Contents

  12. Visualization

    • kaggle.com
    zip
    Updated Sep 2, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sujan Shirol (2020). Visualization [Dataset]. https://www.kaggle.com/sujan97/visualization
    Explore at:
    zip(222468 bytes)Available download formats
    Dataset updated
    Sep 2, 2020
    Authors
    Sujan Shirol
    Description

    Context

    This is a curated pack of the dataset for learning data visualization. Refer notebook on how to learn visualization using these datasets.

    Content

    1. altcoins.csv
    2. barchart-divorcerates.csv
    3. barcharts-us-music-sales-by-genre.csv
    4. df_weed.csv
    5. index.csv
    6. iris.csv
    7. mtcars_csv-Copy1.csv
    8. sales.csv
    9. setosa-versicolor.csv
    10. tips.csvcalendar_view_week
  13. B

    Exploratory Data Analysis of Airbnb Data

    • borealisdata.ca
    • dataone.org
    Updated Dec 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Imad Ahmad; Ibtassam Rasheed; Yip Chi Man (2022). Exploratory Data Analysis of Airbnb Data [Dataset]. http://doi.org/10.5683/SP3/F2OCZF
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 19, 2022
    Dataset provided by
    Borealis
    Authors
    Imad Ahmad; Ibtassam Rasheed; Yip Chi Man
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Airbnb® is an American company operating an online marketplace for lodging, primarily for vacation rentals. The purpose of this study is to perform an exploratory data analysis of the two datasets containing Airbnb® listings and across 10 major cities. We aim to use various data visualizations to gain valuable insight on the effects of pricing, covid, and more!

  14. Exploratory Data Analysis: Crop Classification & Forecasting - EDA

    • ai.tracebloc.io
    json
    Updated Nov 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    tracebloc (2025). Exploratory Data Analysis: Crop Classification & Forecasting - EDA [Dataset]. https://ai.tracebloc.io/explore/ai-satellite-crop-classification-and-yield-forecasting?tab=exploratory-data-analysis
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Nov 17, 2025
    Dataset provided by
    Tracebloc GmbH
    Authors
    tracebloc
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Missing Values
    Measurement technique
    Statistical and exploratory data analysis
    Description

    Understand the satellite dataset before training. Explore distributions, preprocessing steps, and key insights to evaluate 3rd-party models effectively.

  15. Slack Queries

    • kaggle.com
    zip
    Updated Aug 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aikanshi Vaish (2021). Slack Queries [Dataset]. https://www.kaggle.com/aikanshivaish/slack-queries
    Explore at:
    zip(6766 bytes)Available download formats
    Dataset updated
    Aug 16, 2021
    Authors
    Aikanshi Vaish
    Description
    • Hello Kagglers,
    • If you are new to kaggle and want to learn handling of datetime type dataset, this can be helpful for you to access and get best possible date time values. It includes some missing values in date time columns.
    • This is a dataset for beginners who want to learn EDA on datetime type data.
    • Three out of eight columns are datetime type convertible and object type in raw data.
    • You can use any method for filling missing values of datetime which you find will be best match
  16. f

    Data_Sheet_3_Mind the Queue: A Case Study in Visualizing Heterogeneous...

    • frontiersin.figshare.com
    xml
    Updated Jun 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Catherine McVey; Fushing Hsieh; Diego Manriquez; Pablo Pinedo; Kristina Horback (2023). Data_Sheet_3_Mind the Queue: A Case Study in Visualizing Heterogeneous Behavioral Patterns in Livestock Sensor Data Using Unsupervised Machine Learning Techniques.ZIP [Dataset]. http://doi.org/10.3389/fvets.2020.00523.s003
    Explore at:
    xmlAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Frontiers
    Authors
    Catherine McVey; Fushing Hsieh; Diego Manriquez; Pablo Pinedo; Kristina Horback
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sensor technologies allow ethologists to continuously monitor the behaviors of large numbers of animals over extended periods of time. This creates new opportunities to study livestock behavior in commercial settings, but also new methodological challenges. Densely sampled behavioral data from large heterogeneous groups can contain a range of complex patterns and stochastic structures that may be difficult to visualize using conventional exploratory data analysis techniques. The goal of this research was to assess the efficacy of unsupervised machine learning tools in recovering complex behavioral patterns from such datasets to better inform subsequent statistical modeling. This methodological case study was carried out using records on milking order, or the sequence in which cows arrange themselves as they enter the milking parlor. Data was collected over a 6-month period from a closed group of 200 mixed-parity Holstein cattle on an organic dairy. Cows at the front and rear of the queue proved more consistent in their entry position than animals at the center of the queue, a systematic pattern of heterogeneity more clearly visualized using entropy estimates, a scale and distribution-free alternative to variance robust to outliers. Dimension reduction techniques were then used to visualize relationships between cows. No evidence of social cohesion was recovered, but Diffusion Map embeddings proved more adept than PCA at revealing the underlying linear geometry of this data. Median parlor entry positions from the pre- and post-pasture subperiods were highly correlated (R = 0.91), suggesting a surprising degree of temporal stationarity. Data Mechanics visualizations, however, revealed heterogeneous non-stationary among subgroups of animals in the center of the group and herd-level temporal outliers. A repeated measures model recovered inconsistent evidence of a relationships between entry position and cow attributes. Mutual conditional entropy tests, a permutation-based approach to assessing bivariate correlations robust to non-independence, confirmed a significant but non-linear association with peak milk yield, but revealed the age effect to be potentially confounded by health status. Finally, queueing records were related back to behaviors recorded via ear tag accelerometers using linear models and mutual conditional entropy tests. Both approaches recovered consistent evidence of differences in home pen behaviors across subsections of the queue.

  17. h

    diabetes_eda_analysis

    • huggingface.co
    Updated Nov 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GUY SHILO (2025). diabetes_eda_analysis [Dataset]. https://huggingface.co/datasets/guyshilo12/diabetes_eda_analysis
    Explore at:
    Dataset updated
    Nov 19, 2025
    Authors
    GUY SHILO
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Diabetes Dataset — Exploratory Data Analysis (EDA)

    This repository contains a diabetes-related tabular dataset and a complete Exploratory Data Analysis (EDA).The main objective of this project was to learn how to conduct a structured EDA, apply best practices, and extract meaningful insights from real-world health data.
    The analysis includes correlations, distributions, group comparisons, class balance exploration, and statistical interpretations that illustrate how different… See the full description on the dataset page: https://huggingface.co/datasets/guyshilo12/diabetes_eda_analysis.

  18. SEM regression for H1-5.

    • plos.figshare.com
    xls
    Updated Nov 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn (2024). SEM regression for H1-5. [Dataset]. http://doi.org/10.1371/journal.pone.0309318.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 4, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Recent calls to take up data science either revolve around the superior predictive performance associated with machine learning or the potential of data science techniques for exploratory data analysis. Many believe that these strengths come at the cost of explanatory insights, which form the basis for theorization. In this paper, we show that this trade-off is false. When used as a part of a full research process, including inductive, deductive and abductive steps, machine learning can offer explanatory insights and provide a solid basis for theorization. We present a systematic five-step theory-building and theory-testing cycle that consists of: 1. Element identification (reduction); 2. Exploratory analysis (induction); 3. Hypothesis development (retroduction); 4. Hypothesis testing (deduction); and 5. Theorization (abduction). We demonstrate the usefulness of this approach, which we refer to as co-duction, in a vignette where we study firm growth with real-world observational data.

  19. Credit EDA Case Study Data

    • kaggle.com
    zip
    Updated Jan 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ADITYA MISHRA (2022). Credit EDA Case Study Data [Dataset]. https://www.kaggle.com/adityamishra0708/credit-eda-case-study-data
    Explore at:
    zip(117814223 bytes)Available download formats
    Dataset updated
    Jan 11, 2022
    Authors
    ADITYA MISHRA
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by ADITYA MISHRA

    Released under CC0: Public Domain

    Contents

  20. m

    Reddit r/AskScience Flair Dataset

    • data.mendeley.com
    Updated May 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sumit Mishra (2022). Reddit r/AskScience Flair Dataset [Dataset]. http://doi.org/10.17632/k9r2d9z999.3
    Explore at:
    Dataset updated
    May 23, 2022
    Authors
    Sumit Mishra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.

    The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).

    The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.

    This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Victoria Da Poian; Bethany Theiling; Lily Clough; Brett McKinney; Jonathan Major; Jingyi Chen; Sarah Hörst (2023). DataSheet1_Exploratory data analysis (EDA) machine learning approaches for ocean world analog mass spectrometry.docx [Dataset]. http://doi.org/10.3389/fspas.2023.1134141.s001

DataSheet1_Exploratory data analysis (EDA) machine learning approaches for ocean world analog mass spectrometry.docx

Related Article
Explore at:
docxAvailable download formats
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Victoria Da Poian; Bethany Theiling; Lily Clough; Brett McKinney; Jonathan Major; Jingyi Chen; Sarah Hörst
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered
World
Description

Many upcoming and proposed missions to ocean worlds such as Europa, Enceladus, and Titan aim to evaluate their habitability and the existence of potential life on these moons. These missions will suffer from communication challenges and technology limitations. We review and investigate the applicability of data science and unsupervised machine learning (ML) techniques on isotope ratio mass spectrometry data (IRMS) from volatile laboratory analogs of Europa and Enceladus seawaters as a case study for development of new strategies for icy ocean world missions. Our driving science goal is to determine whether the mass spectra of volatile gases could contain information about the composition of the seawater and potential biosignatures. We implement data science and ML techniques to investigate what inherent information the spectra contain and determine whether a data science pipeline could be designed to quickly analyze data from future ocean worlds missions. In this study, we focus on the exploratory data analysis (EDA) step in the analytics pipeline. This is a crucial unsupervised learning step that allows us to understand the data in depth before subsequent steps such as predictive/supervised learning. EDA identifies and characterizes recurring patterns, significant correlation structure, and helps determine which variables are redundant and which contribute to significant variation in the lower dimensional space. In addition, EDA helps to identify irregularities such as outliers that might be due to poor data quality. We compared dimensionality reduction methods Uniform Manifold Approximation and Projection (UMAP) and Principal Component Analysis (PCA) for transforming our data from a high-dimensional space to a lower dimension, and we compared clustering algorithms for identifying data-driven groups (“clusters”) in the ocean worlds analog IRMS data and mapping these clusters to experimental conditions such as seawater composition and CO2 concentration. Such data analysis and characterization efforts are the first steps toward the longer-term science autonomy goal where similar automated ML tools could be used onboard a spacecraft to prioritize data transmissions for bandwidth-limited outer Solar System missions.

Search
Clear search
Close search
Google apps
Main menu