37 datasets found
  1. Z

    Data Analysis for the Systematic Literature Review of DL4SE

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Washington and Lee University
    College of William and Mary
    Authors
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

    The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

    Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

    Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

    Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

    Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

    Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

    We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

    Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

    Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise

  2. Google Data Analytics Capstone

    • kaggle.com
    zip
    Updated Aug 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Reilly McCarthy (2022). Google Data Analytics Capstone [Dataset]. https://www.kaggle.com/datasets/reillymccarthy/google-data-analytics-capstone/discussion
    Explore at:
    zip(67456 bytes)Available download formats
    Dataset updated
    Aug 9, 2022
    Authors
    Reilly McCarthy
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Hello! Welcome to the Capstone project I have completed to earn my Data Analytics certificate through Google. I chose to complete this case study through RStudio desktop. The reason I did this is that R is the primary new concept I learned throughout this course. I wanted to embrace my curiosity and learn more about R through this project. In the beginning of this report I will provide the scenario of the case study I was given. After this I will walk you through my Data Analysis process based on the steps I learned in this course:

    1. Ask
    2. Prepare
    3. Process
    4. Analyze
    5. Share
    6. Act

    The data I used for this analysis comes from this FitBit data set: https://www.kaggle.com/datasets/arashnic/fitbit

    " This dataset generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. "

  3. Groups of words for our Z and X variables.

    • plos.figshare.com
    xls
    Updated Nov 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn (2024). Groups of words for our Z and X variables. [Dataset]. http://doi.org/10.1371/journal.pone.0309318.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 4, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Recent calls to take up data science either revolve around the superior predictive performance associated with machine learning or the potential of data science techniques for exploratory data analysis. Many believe that these strengths come at the cost of explanatory insights, which form the basis for theorization. In this paper, we show that this trade-off is false. When used as a part of a full research process, including inductive, deductive and abductive steps, machine learning can offer explanatory insights and provide a solid basis for theorization. We present a systematic five-step theory-building and theory-testing cycle that consists of: 1. Element identification (reduction); 2. Exploratory analysis (induction); 3. Hypothesis development (retroduction); 4. Hypothesis testing (deduction); and 5. Theorization (abduction). We demonstrate the usefulness of this approach, which we refer to as co-duction, in a vignette where we study firm growth with real-world observational data.

  4. Process Mining Event Log - Incident Management

    • kaggle.com
    zip
    Updated Apr 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alberto P (2025). Process Mining Event Log - Incident Management [Dataset]. https://www.kaggle.com/datasets/albertopmd/process-mining-event-log-incident-management
    Explore at:
    zip(2301112 bytes)Available download formats
    Dataset updated
    Apr 20, 2025
    Authors
    Alberto P
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This realistic incident management event log simulates a common IT service process and includes key inefficiencies found in real-world operations. You'll uncover SLA violations, multiple reassignments, bottlenecks, and conformance issues—making it an ideal dataset for hands-on process mining, root cause analysis, and performance optimization exercises.

    You can find more event logs + use case handbooks to guide your analysis here: https://processminingdata.com/

    Standard Process Flow: Ticket Created -> Ticket Assigned to Level 1 Support -> WIP - Level 1 Support -> Level 1 Escalates to Level 2 Support -> WIP - Level 2 Support -> Ticket Solved by Level 2 Support -> Customer Feedback Received -> Ticket Closed

    Total Number of Incident Tickets: 31,000+

    Process Variants: 13

    Number of Events: 242,000+

    Year: 2023

    File Format: CSV

    File Size: 65MB

  5. The five-step co-duction cycle.

    • plos.figshare.com
    xls
    Updated Nov 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn (2024). The five-step co-duction cycle. [Dataset]. http://doi.org/10.1371/journal.pone.0309318.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 4, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Recent calls to take up data science either revolve around the superior predictive performance associated with machine learning or the potential of data science techniques for exploratory data analysis. Many believe that these strengths come at the cost of explanatory insights, which form the basis for theorization. In this paper, we show that this trade-off is false. When used as a part of a full research process, including inductive, deductive and abductive steps, machine learning can offer explanatory insights and provide a solid basis for theorization. We present a systematic five-step theory-building and theory-testing cycle that consists of: 1. Element identification (reduction); 2. Exploratory analysis (induction); 3. Hypothesis development (retroduction); 4. Hypothesis testing (deduction); and 5. Theorization (abduction). We demonstrate the usefulness of this approach, which we refer to as co-duction, in a vignette where we study firm growth with real-world observational data.

  6. d

    Data from: Research and exploratory analysis driven - time-data...

    • datadryad.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Jan 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Del Gaizo; Kenneth Catchpole; Alexander Alekseyenko (2022). Research and exploratory analysis driven - time-data visualization (read-tv) software [Dataset]. http://doi.org/10.5061/dryad.d51c5b02g
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 30, 2022
    Dataset provided by
    Dryad
    Authors
    John Del Gaizo; Kenneth Catchpole; Alexander Alekseyenko
    Time period covered
    Jan 25, 2021
    Description

    This section does not describe the methods of read-tv software development, which can be found in the associated manuscript from JAMIA Open (JAMIO-2020-0121.R1). This section describes the methods involved in the surgical work flow disruption data collection. A curated, PHI-free (protected health information) version of this dataset was used as a use case for this manuscript.

    Observer training

    Trained human factors researchers conducted each observation following the completion of observer training. The researchers were two full-time research assistants based in the department of surgery at site 3 who visited the other two sites to collect data. Human Factors experts guided and trained each observer in the identification and standardized collection of FDs. The observers were also trained in the basic components of robotic surgery in order to be able to tangibly isolate and describe such disruptive events.

    Comprehensive observer training was ensured with both classroom and floor train...

  7. h

    nevo-reuven_fifa23-player-analysis

    • huggingface.co
    Updated Nov 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nevo Reuven (2025). nevo-reuven_fifa23-player-analysis [Dataset]. https://huggingface.co/datasets/Nevoreuven/nevo-reuven_fifa23-player-analysis
    Explore at:
    Dataset updated
    Nov 12, 2025
    Authors
    Nevo Reuven
    Description

    ⚽ FIFA 23 Player Market Value Analysis

      📘 Overview
    

    This project analyzes the FIFA 23 Complete Player Dataset to explorewhich player attributes have the greatest influence on a football player's market value. The analysis includes:

    Data Loading
    Data Cleaning
    Handling Missing Values
    Outlier Detection
    Feature Preparation
    Exploratory Data Analysis (EDA)
    Visualizations
    Insights & Conclusions

    This document summarizes the full workflow and the analytical… See the full description on the dataset page: https://huggingface.co/datasets/Nevoreuven/nevo-reuven_fifa23-player-analysis.

  8. SEM regression for H1-5.

    • plos.figshare.com
    xls
    Updated Nov 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn (2024). SEM regression for H1-5. [Dataset]. http://doi.org/10.1371/journal.pone.0309318.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 4, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Recent calls to take up data science either revolve around the superior predictive performance associated with machine learning or the potential of data science techniques for exploratory data analysis. Many believe that these strengths come at the cost of explanatory insights, which form the basis for theorization. In this paper, we show that this trade-off is false. When used as a part of a full research process, including inductive, deductive and abductive steps, machine learning can offer explanatory insights and provide a solid basis for theorization. We present a systematic five-step theory-building and theory-testing cycle that consists of: 1. Element identification (reduction); 2. Exploratory analysis (induction); 3. Hypothesis development (retroduction); 4. Hypothesis testing (deduction); and 5. Theorization (abduction). We demonstrate the usefulness of this approach, which we refer to as co-duction, in a vignette where we study firm growth with real-world observational data.

  9. Feature contributions and top-three feature interactions (MFIs).

    • plos.figshare.com
    xls
    Updated Nov 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn (2024). Feature contributions and top-three feature interactions (MFIs). [Dataset]. http://doi.org/10.1371/journal.pone.0309318.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 4, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Feature contributions and top-three feature interactions (MFIs).

  10. Smartphones Dataset (August 2024)

    • kaggle.com
    zip
    Updated Aug 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dilkush Singh (2024). Smartphones Dataset (August 2024) [Dataset]. https://www.kaggle.com/datasets/dilkushsingh/smartphones-dataset-upto-july24
    Explore at:
    zip(605033 bytes)Available download formats
    Dataset updated
    Aug 24, 2024
    Authors
    Dilkush Singh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Smartphones Dataset (August 2024)

    This dataset contains information on the latest smartphones as of July 2024, gathered through web scraping using Selenium and Beautiful Soup. The dataset is available in four different versions, reflecting the stages of data cleaning and processing.
    - If you want to know about the web scrapping process then read the blog Medium Article - If you want to see the step by step process of Data Cleaning and EDA then checkout my GitHub repo GitHub Repo

    Dataset Versions:

    Version 1: Raw Data (smartphones.csv or smartphones_uncleaned.csv - same files)

    This version contains the fully uncleaned data as it was initially scraped from the web. It includes all the raw information, with inconsistencies, missing values, and potential duplicates. Purpose: Serves as the baseline dataset for understanding the initial state of the data before any cleaning or processing.

    Version 2: Basic Cleaning (smartphones_cleaned_v1.csv)

    Basic cleaning operations have been applied. This includes removing duplicates, handling missing values, and standardizing the formats of certain fields (e.g., dates, numerical values). Purpose: Provides a cleaner and more consistent dataset, making it easier for basic analysis.

    Version 3: Intermediate Cleaning (smartphones_cleaned_v2.csv)

    Additional data cleaning techniques have been implemented. This version addresses more complex issues such as outlier detection and correction, normalization of categorical data, and initial feature engineering. Purpose: Offers a more refined dataset suitable for exploratory data analysis (EDA) and more in-depth statistical analyses.

    Version 4: Fully Cleaned and Processed Data (smartphones_cleaned_v3.csv)

    This version represents the final, fully cleaned dataset. Advanced cleaning techniques have been applied, including imputation of missing data, removal of irrelevant features, and final feature engineering. Purpose: Ideal for machine learning model training and other advanced analytics.

  11. Higgs bosons and a background process

    • kaggle.com
    zip
    Updated Jan 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PAVAN KUMAR D (2021). Higgs bosons and a background process [Dataset]. https://www.kaggle.com/mragpavank/higs-bonsons-and-background-process
    Explore at:
    zip(11985839 bytes)Available download formats
    Dataset updated
    Jan 16, 2021
    Authors
    PAVAN KUMAR D
    Description

    Dataset

    This dataset was created by PAVAN KUMAR D

    Contents

  12. GEO_Processing_Exploratory_DGE_Analysis

    • kaggle.com
    zip
    Updated Nov 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr. Nagendra (2025). GEO_Processing_Exploratory_DGE_Analysis [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/geo-processing-exploratory-dge-analysis
    Explore at:
    zip(13026816 bytes)Available download formats
    Dataset updated
    Nov 29, 2025
    Authors
    Dr. Nagendra
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset provides a comprehensive workflow for differential gene expression (DGE) analysis.

    It focuses on processing and analyzing GEO (Gene Expression Omnibus) datasets.

    The dataset includes code for retrieving GEO datasets directly from NCBI GEO.

    It provides data cleaning, normalization, and pre-processing steps for gene expression data.

    The workflow demonstrates exploratory data analysis (EDA) on gene expression datasets.

    Differential expression analysis is performed to identify significantly expressed genes.

    Includes visualizations such as heatmaps, volcano plots, and PCA for insights.

    Designed for researchers and bioinformaticians interested in gene expression analysis.

    Supports reproducibility and can be adapted to different GEO datasets.

    Uses Python programming language and popular bioinformatics libraries like pandas, numpy, and matplotlib.

    Encourages integration with downstream functional enrichment and pathway analysis.

  13. 99 Little Orange, Technical Business Case

    • kaggle.com
    zip
    Updated Jun 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IVAN CHAVEZ (2022). 99 Little Orange, Technical Business Case [Dataset]. https://www.kaggle.com/datasets/ivanchvez/99littleorange
    Explore at:
    zip(91998345 bytes)Available download formats
    Dataset updated
    Jun 13, 2022
    Authors
    IVAN CHAVEZ
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    99 Little Orange, Technical Business Case

    Dear candidate, we are so excited with your interest in working with us! This challenge is an opportunity for us to know a bit of the great talent we know you have. It was built to simulate real-case scenarios that you would face while working at [Organization] and is organized in 2 parts:

      1. A technical part of close-ended questions with specific answers that are meant to assess your ability to analyze large amounts of data with SQL to answer key questions.
      1. An analytical part of open-ended questions to assess your ability to build data-backed recommendations to support decision-making. Expect further questions and discussions on top of your answers in the next phase of our hiring process.

    Part I - Technical Provide both the answer and the SQL code used. 1. What is the average trip cost of holidays? How does it compare to non-holidays? 2. Find the average call time of the first time passengers make a trip. 3. Find the average number of trips per driver for every week day. 4. Which day of the week drivers usually drive the most distance on average? 5. What was the growth percentage of rides month over month? 6. Optional. List the top 5 drivers per number of trips in the top 5 largest cities.

    Part II - Analytical 99 is a marketplace, where drivers are the supply and passengers, the demand. One of our main challenges is to keep this marketplace balanced. If there's too much demand, prices would increase due to surge and passengers would prefer not to run. If there's too much supply, drivers would spend more time idle impacting their revenue. 1. Let's say it's 2019-09-23 and a new Operations manager for The Shire was just hired. She has 5 minutes during the Ops weekly meeting to present an overview of the business in the city, and since she's just arrived, she asked your help to do it. What would you prepare for this 5 minutes presentation? Please provide 1-2 slides with your idea. 2. She also mentioned she has a budget to invest in promoting the business. What kind of metrics and performance indicators would you use in order to help her decide if she should invest it into the passenger side or the driver side? Extra point if you provide data-backed recommendations. 3. One month later, she comes back, super grateful for all the helpful insights you have given her. And says she is anticipating a driver supply shortage due to a major concert that is going to take place the next day and also a 3 day city holiday that is coming the next month. What would you do to help her analyze the best course of action to either prevent or minimize the problem in each case? 4. Optional. We want to build up a model to predict “Possible Churn Users” (e.g.: no trips in the past 4 weeks). List all features that you can think about and the data mining or machine learning model or other methods you may use for this case.

  14. Facebook User Engagement Data (29 chars)

    • kaggle.com
    zip
    Updated Oct 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saiful Islam Rafi (2025). Facebook User Engagement Data (29 chars) [Dataset]. https://www.kaggle.com/datasets/saifulislamrafixyz/facebook-user-engagement-data-29-chars
    Explore at:
    zip(485217 bytes)Available download formats
    Dataset updated
    Oct 12, 2025
    Authors
    Saiful Islam Rafi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A comprehensive Facebook user dataset containing 20 features including user demographics (age, gender, country), account information (verification status, account type), and engagement metrics (likes, comments, shares, posts). The dataset includes realistic data quality issues such as missing values (NaN), duplicates, outliers, typos, inconsistent formats, impossible values, and mixed data types. Ideal for practicing data cleaning, exploratory data analysis (EDA), feature engineering, and data preprocessing workflows.

  15. Weights of variables in the indicator system.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Dec 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Siwei Yu; Ding Fan; Ma Ge; Zihang Chen (2024). Weights of variables in the indicator system. [Dataset]. http://doi.org/10.1371/journal.pone.0314242.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 16, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Siwei Yu; Ding Fan; Ma Ge; Zihang Chen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The article examines the spatial distribution characteristics and influencing factors of traditional Tibetan “Bengke” residential architecture in Luhuo County, Ganzi Tibetan Autonomous Prefecture, Sichuan Province. The study utilizes spatial statistical methods, including Average Nearest Neighbor Analysis, Getis-Ord Gi*, and Kernel Density Estimation, to identify significant clustering patterns of Bengke architecture. Spatial autocorrelation was tested using Moran’s Index, with results indicating no significant spatial autocorrelation, suggesting that the distribution mechanisms are complex and influenced by multiple factors. Additionally, exploratory data analysis (EDA), the Analytic Hierarchy Process (AHP), and regression methods such as Lasso and Elastic Net were used to identify and validate key factors influencing the distribution of these buildings. The analysis reveals that road density, population density, economic development quality, and industrial structure are the most significant factors. The study also highlights that these factors vary in impact between high-density and low-density areas, depending on the regional environment. These findings offer a comprehensive understanding of the spatial patterns of Bengke architecture and provide valuable insights for the preservation and sustainable development of this cultural heritage.

  16. Factors influencing high and low aggregation areas.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Siwei Yu; Ding Fan; Ma Ge; Zihang Chen (2024). Factors influencing high and low aggregation areas. [Dataset]. http://doi.org/10.1371/journal.pone.0314242.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 16, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Siwei Yu; Ding Fan; Ma Ge; Zihang Chen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Factors influencing high and low aggregation areas.

  17. Bellabeat Case Study Outline

    • kaggle.com
    zip
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sydney Yauney (2024). Bellabeat Case Study Outline [Dataset]. https://www.kaggle.com/datasets/sydneylynnyoung/bellabeat-case-study-clean-data/data
    Explore at:
    zip(5365 bytes)Available download formats
    Dataset updated
    Jun 25, 2024
    Authors
    Sydney Yauney
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    In this project, I was able to provide valuable insight to the Bellabeat marketing team through the process of cleaning and analyzing public smartwatch data in order to find smartwatch usage trends.

    I followed all six of the data analysis steps in this project: Ask, Prepare, Process, Analyze, Share, and Act. This involved focusing on the business task, searching for credible data, cleaning and analyzing the data, creating simple and effective data vizualizations, and coming up with a final recommendation and presentation for stakeholders. The tools I chose to use were Bigquery and Tableau. I chose these due to the large size of the data.

    The data attached is the result of the combination of 4 public datasets, found on Kaggle. All 4 datasets contain data from Fitbit users. The data was combined and cleaned, and then a second table was created, after parsing the weekday from the dates of the original data.

    Below are some data vizualizations created to represent trends in this data.

    Most Popular Time to Be Active - https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F18883869%2Fb33075a778528a9333bbce6d81d82ab4%2FMost%20Popular%20Time%20to%20Be%20Active.png?generation=1719356850677474&alt=media" alt="">

    Most Popular Day to Be Active -

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F18883869%2Ff2279ab19b9bd5aaf2e44dfaf7320ddd%2FMost%20Popular%20Day%20to%20Be%20Active.png?generation=1719356897435191&alt=media" alt="">

    After finding trends that showed that smartwatch users are most active on the weekends and during the evenings, my solution was to create a marketing campaign geared towards women who work a 9-5 sedentary-style job. The goal of the project, which was to provide valuable insight based off of public data, was accomplished through this recommendation. I cleaned and analyzed public Fitbit user data, identified trends, and provided an insightful reccomendation for a new focus on the marketing team.

  18. Titanic Dataset

    • kaggle.com
    zip
    Updated Sep 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prince Rajak (2025). Titanic Dataset [Dataset]. https://www.kaggle.com/datasets/prince7489/titanic-dataset
    Explore at:
    zip(1849548 bytes)Available download formats
    Dataset updated
    Sep 29, 2025
    Authors
    Prince Rajak
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The Titanic Survival Prediction Project explores one of the most iconic datasets in data science. The goal is to predict whether a passenger survived the Titanic disaster based on key attributes such as age, gender, ticket class, family size, and fare.

    Using a dataset of 100,000 synthetic records inspired by the original Titanic data, this project demonstrates a complete data science workflow — including data cleaning, exploratory data analysis (EDA), feature engineering, and predictive modeling.

    By analyzing patterns (e.g., higher survival rates among women, children, and first-class passengers), the project showcases how machine learning can uncover meaningful insights from historical events.

  19. Demographics of patients.

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    • +1more
    xls
    Updated Jun 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prathiba Natesan; Dima Hadid; Yara Abou Harb; Eveline Hitti (2023). Demographics of patients. [Dataset]. http://doi.org/10.1371/journal.pone.0221087.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Prathiba Natesan; Dima Hadid; Yara Abou Harb; Eveline Hitti
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Demographics of patients.

  20. Factor/pattern coefficients from exploratory factor analysis (EFA) and...

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prathiba Natesan; Dima Hadid; Yara Abou Harb; Eveline Hitti (2023). Factor/pattern coefficients from exploratory factor analysis (EFA) and confirmatory factor analysis (CFA). [Dataset]. http://doi.org/10.1371/journal.pone.0221087.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Prathiba Natesan; Dima Hadid; Yara Abou Harb; Eveline Hitti
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Factor/pattern coefficients from exploratory factor analysis (EFA) and confirmatory factor analysis (CFA).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586

Data Analysis for the Systematic Literature Review of DL4SE

Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
Washington and Lee University
College of William and Mary
Authors
Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise

Search
Clear search
Close search
Google apps
Main menu