37 datasets found

Z
Data Analysis for the Systematic Literature Review of DL4SE
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586
Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
Washington and Lee University
College of William and Mary
Authors
Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise
Google Data Analytics Capstone
kaggle.com
zip
Updated Aug 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Reilly McCarthy (2022). Google Data Analytics Capstone [Dataset]. https://www.kaggle.com/datasets/reillymccarthy/google-data-analytics-capstone/discussion
Explore at:
zip(67456 bytes)Available download formats
Dataset updated
Aug 9, 2022
Authors
Reilly McCarthy
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Hello! Welcome to the Capstone project I have completed to earn my Data Analytics certificate through Google. I chose to complete this case study through RStudio desktop. The reason I did this is that R is the primary new concept I learned throughout this course. I wanted to embrace my curiosity and learn more about R through this project. In the beginning of this report I will provide the scenario of the case study I was given. After this I will walk you through my Data Analysis process based on the steps I learned in this course:

Ask

Prepare

Process

Analyze

Share

Act

The data I used for this analysis comes from this FitBit data set: https://www.kaggle.com/datasets/arashnic/fitbit

" This dataset generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. "
Groups of words for our Z and X variables.
plos.figshare.com
xls
Updated Nov 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn (2024). Groups of words for our Z and X variables. [Dataset]. http://doi.org/10.1371/journal.pone.0309318.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0309318.t002
Dataset updated
Nov 4, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Recent calls to take up data science either revolve around the superior predictive performance associated with machine learning or the potential of data science techniques for exploratory data analysis. Many believe that these strengths come at the cost of explanatory insights, which form the basis for theorization. In this paper, we show that this trade-off is false. When used as a part of a full research process, including inductive, deductive and abductive steps, machine learning can offer explanatory insights and provide a solid basis for theorization. We present a systematic five-step theory-building and theory-testing cycle that consists of: 1. Element identification (reduction); 2. Exploratory analysis (induction); 3. Hypothesis development (retroduction); 4. Hypothesis testing (deduction); and 5. Theorization (abduction). We demonstrate the usefulness of this approach, which we refer to as co-duction, in a vignette where we study firm growth with real-world observational data.
Process Mining Event Log - Incident Management
kaggle.com
zip
Updated Apr 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alberto P (2025). Process Mining Event Log - Incident Management [Dataset]. https://www.kaggle.com/datasets/albertopmd/process-mining-event-log-incident-management
Explore at:
zip(2301112 bytes)Available download formats
Dataset updated
Apr 20, 2025
Authors
Alberto P
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This realistic incident management event log simulates a common IT service process and includes key inefficiencies found in real-world operations. You'll uncover SLA violations, multiple reassignments, bottlenecks, and conformance issues—making it an ideal dataset for hands-on process mining, root cause analysis, and performance optimization exercises.

You can find more event logs + use case handbooks to guide your analysis here: https://processminingdata.com/

Standard Process Flow: Ticket Created -> Ticket Assigned to Level 1 Support -> WIP - Level 1 Support -> Level 1 Escalates to Level 2 Support -> WIP - Level 2 Support -> Ticket Solved by Level 2 Support -> Customer Feedback Received -> Ticket Closed

Total Number of Incident Tickets: 31,000+

Process Variants: 13

Number of Events: 242,000+

Year: 2023

File Format: CSV

File Size: 65MB
The five-step co-duction cycle.
plos.figshare.com
xls
Updated Nov 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn (2024). The five-step co-duction cycle. [Dataset]. http://doi.org/10.1371/journal.pone.0309318.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0309318.t001
Dataset updated
Nov 4, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Recent calls to take up data science either revolve around the superior predictive performance associated with machine learning or the potential of data science techniques for exploratory data analysis. Many believe that these strengths come at the cost of explanatory insights, which form the basis for theorization. In this paper, we show that this trade-off is false. When used as a part of a full research process, including inductive, deductive and abductive steps, machine learning can offer explanatory insights and provide a solid basis for theorization. We present a systematic five-step theory-building and theory-testing cycle that consists of: 1. Element identification (reduction); 2. Exploratory analysis (induction); 3. Hypothesis development (retroduction); 4. Hypothesis testing (deduction); and 5. Theorization (abduction). We demonstrate the usefulness of this approach, which we refer to as co-duction, in a vignette where we study firm growth with real-world observational data.
d
Data from: Research and exploratory analysis driven - time-data...
datadryad.org
data.niaid.nih.gov
+1more
zip
Updated Jan 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Del Gaizo; Kenneth Catchpole; Alexander Alekseyenko (2022). Research and exploratory analysis driven - time-data visualization (read-tv) software [Dataset]. http://doi.org/10.5061/dryad.d51c5b02g
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.d51c5b02g
Dataset updated
Jan 30, 2022
Dataset provided by
Dryad
Authors
John Del Gaizo; Kenneth Catchpole; Alexander Alekseyenko
Time period covered
Jan 25, 2021
Description
This section does not describe the methods of read-tv software development, which can be found in the associated manuscript from JAMIA Open (JAMIO-2020-0121.R1). This section describes the methods involved in the surgical work flow disruption data collection. A curated, PHI-free (protected health information) version of this dataset was used as a use case for this manuscript.

Observer training

Trained human factors researchers conducted each observation following the completion of observer training. The researchers were two full-time research assistants based in the department of surgery at site 3 who visited the other two sites to collect data. Human Factors experts guided and trained each observer in the identification and standardized collection of FDs. The observers were also trained in the basic components of robotic surgery in order to be able to tangibly isolate and describe such disruptive events.

Comprehensive observer training was ensured with both classroom and floor train...
h
nevo-reuven_fifa23-player-analysis
huggingface.co
Updated Nov 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nevo Reuven (2025). nevo-reuven_fifa23-player-analysis [Dataset]. https://huggingface.co/datasets/Nevoreuven/nevo-reuven_fifa23-player-analysis
Explore at:
Dataset updated
Nov 12, 2025
Authors
Nevo Reuven
Description
⚽ FIFA 23 Player Market Value Analysis

📘 Overview

This project analyzes the FIFA 23 Complete Player Dataset to explorewhich player attributes have the greatest influence on a football player's market value. The analysis includes:

Data Loading
Data Cleaning
Handling Missing Values
Outlier Detection
Feature Preparation
Exploratory Data Analysis (EDA)
Visualizations
Insights & Conclusions

This document summarizes the full workflow and the analytical… See the full description on the dataset page: https://huggingface.co/datasets/Nevoreuven/nevo-reuven_fifa23-player-analysis.
SEM regression for H1-5.
plos.figshare.com
xls
Updated Nov 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn (2024). SEM regression for H1-5. [Dataset]. http://doi.org/10.1371/journal.pone.0309318.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0309318.t004
Dataset updated
Nov 4, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Recent calls to take up data science either revolve around the superior predictive performance associated with machine learning or the potential of data science techniques for exploratory data analysis. Many believe that these strengths come at the cost of explanatory insights, which form the basis for theorization. In this paper, we show that this trade-off is false. When used as a part of a full research process, including inductive, deductive and abductive steps, machine learning can offer explanatory insights and provide a solid basis for theorization. We present a systematic five-step theory-building and theory-testing cycle that consists of: 1. Element identification (reduction); 2. Exploratory analysis (induction); 3. Hypothesis development (retroduction); 4. Hypothesis testing (deduction); and 5. Theorization (abduction). We demonstrate the usefulness of this approach, which we refer to as co-duction, in a vignette where we study firm growth with real-world observational data.
Feature contributions and top-three feature interactions (MFIs).
plos.figshare.com
xls
Updated Nov 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn (2024). Feature contributions and top-three feature interactions (MFIs). [Dataset]. http://doi.org/10.1371/journal.pone.0309318.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0309318.t003
Dataset updated
Nov 4, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Feature contributions and top-three feature interactions (MFIs).
Smartphones Dataset (August 2024)
kaggle.com
zip
Updated Aug 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dilkush Singh (2024). Smartphones Dataset (August 2024) [Dataset]. https://www.kaggle.com/datasets/dilkushsingh/smartphones-dataset-upto-july24
Explore at:
zip(605033 bytes)Available download formats
Dataset updated
Aug 24, 2024
Authors
Dilkush Singh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Smartphones Dataset (August 2024)

This dataset contains information on the latest smartphones as of July 2024, gathered through web scraping using Selenium and Beautiful Soup. The dataset is available in four different versions, reflecting the stages of data cleaning and processing.
- If you want to know about the web scrapping process then read the blog Medium Article - If you want to see the step by step process of Data Cleaning and EDA then checkout my GitHub repo GitHub Repo

Dataset Versions:

Version 1: Raw Data (smartphones.csv or smartphones_uncleaned.csv - same files)

This version contains the fully uncleaned data as it was initially scraped from the web. It includes all the raw information, with inconsistencies, missing values, and potential duplicates. Purpose: Serves as the baseline dataset for understanding the initial state of the data before any cleaning or processing.

Version 2: Basic Cleaning (smartphones_cleaned_v1.csv)

Basic cleaning operations have been applied. This includes removing duplicates, handling missing values, and standardizing the formats of certain fields (e.g., dates, numerical values). Purpose: Provides a cleaner and more consistent dataset, making it easier for basic analysis.

Version 3: Intermediate Cleaning (smartphones_cleaned_v2.csv)

Additional data cleaning techniques have been implemented. This version addresses more complex issues such as outlier detection and correction, normalization of categorical data, and initial feature engineering. Purpose: Offers a more refined dataset suitable for exploratory data analysis (EDA) and more in-depth statistical analyses.

Version 4: Fully Cleaned and Processed Data (smartphones_cleaned_v3.csv)

This version represents the final, fully cleaned dataset. Advanced cleaning techniques have been applied, including imputation of missing data, removal of irrelevant features, and final feature engineering. Purpose: Ideal for machine learning model training and other advanced analytics.
Higgs bosons and a background process
kaggle.com
zip
Updated Jan 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PAVAN KUMAR D (2021). Higgs bosons and a background process [Dataset]. https://www.kaggle.com/mragpavank/higs-bonsons-and-background-process
Explore at:
zip(11985839 bytes)Available download formats
Dataset updated
Jan 16, 2021
Authors
PAVAN KUMAR D
Description
Dataset

This dataset was created by PAVAN KUMAR D

Contents
GEO_Processing_Exploratory_DGE_Analysis
kaggle.com
zip
Updated Nov 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr. Nagendra (2025). GEO_Processing_Exploratory_DGE_Analysis [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/geo-processing-exploratory-dge-analysis
Explore at:
zip(13026816 bytes)Available download formats
Dataset updated
Nov 29, 2025
Authors
Dr. Nagendra
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset provides a comprehensive workflow for differential gene expression (DGE) analysis.

It focuses on processing and analyzing GEO (Gene Expression Omnibus) datasets.

The dataset includes code for retrieving GEO datasets directly from NCBI GEO.

It provides data cleaning, normalization, and pre-processing steps for gene expression data.

The workflow demonstrates exploratory data analysis (EDA) on gene expression datasets.

Differential expression analysis is performed to identify significantly expressed genes.

Includes visualizations such as heatmaps, volcano plots, and PCA for insights.

Designed for researchers and bioinformaticians interested in gene expression analysis.

Supports reproducibility and can be adapted to different GEO datasets.

Uses Python programming language and popular bioinformatics libraries like pandas, numpy, and matplotlib.

Encourages integration with downstream functional enrichment and pathway analysis.
99 Little Orange, Technical Business Case
kaggle.com
zip
Updated Jun 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
IVAN CHAVEZ (2022). 99 Little Orange, Technical Business Case [Dataset]. https://www.kaggle.com/datasets/ivanchvez/99littleorange
Explore at:
zip(91998345 bytes)Available download formats
Dataset updated
Jun 13, 2022
Authors
IVAN CHAVEZ
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
99 Little Orange, Technical Business Case

Dear candidate, we are so excited with your interest in working with us! This challenge is an opportunity for us to know a bit of the great talent we know you have. It was built to simulate real-case scenarios that you would face while working at [Organization] and is organized in 2 parts:

A technical part of close-ended questions with specific answers that are meant to assess your ability to analyze large amounts of data with SQL to answer key questions.

An analytical part of open-ended questions to assess your ability to build data-backed recommendations to support decision-making. Expect further questions and discussions on top of your answers in the next phase of our hiring process.

Part I - Technical Provide both the answer and the SQL code used. 1. What is the average trip cost of holidays? How does it compare to non-holidays? 2. Find the average call time of the first time passengers make a trip. 3. Find the average number of trips per driver for every week day. 4. Which day of the week drivers usually drive the most distance on average? 5. What was the growth percentage of rides month over month? 6. Optional. List the top 5 drivers per number of trips in the top 5 largest cities.

Part II - Analytical 99 is a marketplace, where drivers are the supply and passengers, the demand. One of our main challenges is to keep this marketplace balanced. If there's too much demand, prices would increase due to surge and passengers would prefer not to run. If there's too much supply, drivers would spend more time idle impacting their revenue. 1. Let's say it's 2019-09-23 and a new Operations manager for The Shire was just hired. She has 5 minutes during the Ops weekly meeting to present an overview of the business in the city, and since she's just arrived, she asked your help to do it. What would you prepare for this 5 minutes presentation? Please provide 1-2 slides with your idea. 2. She also mentioned she has a budget to invest in promoting the business. What kind of metrics and performance indicators would you use in order to help her decide if she should invest it into the passenger side or the driver side? Extra point if you provide data-backed recommendations. 3. One month later, she comes back, super grateful for all the helpful insights you have given her. And says she is anticipating a driver supply shortage due to a major concert that is going to take place the next day and also a 3 day city holiday that is coming the next month. What would you do to help her analyze the best course of action to either prevent or minimize the problem in each case? 4. Optional. We want to build up a model to predict “Possible Churn Users” (e.g.: no trips in the past 4 weeks). List all features that you can think about and the data mining or machine learning model or other methods you may use for this case.
Facebook User Engagement Data (29 chars)
kaggle.com
zip
Updated Oct 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saiful Islam Rafi (2025). Facebook User Engagement Data (29 chars) [Dataset]. https://www.kaggle.com/datasets/saifulislamrafixyz/facebook-user-engagement-data-29-chars
Explore at:
zip(485217 bytes)Available download formats
Dataset updated
Oct 12, 2025
Authors
Saiful Islam Rafi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A comprehensive Facebook user dataset containing 20 features including user demographics (age, gender, country), account information (verification status, account type), and engagement metrics (likes, comments, shares, posts). The dataset includes realistic data quality issues such as missing values (NaN), duplicates, outliers, typos, inconsistent formats, impossible values, and mixed data types. Ideal for practicing data cleaning, exploratory data analysis (EDA), feature engineering, and data preprocessing workflows.
Weights of variables in the indicator system.
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Dec 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Siwei Yu; Ding Fan; Ma Ge; Zihang Chen (2024). Weights of variables in the indicator system. [Dataset]. http://doi.org/10.1371/journal.pone.0314242.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0314242.t001
Dataset updated
Dec 16, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Siwei Yu; Ding Fan; Ma Ge; Zihang Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The article examines the spatial distribution characteristics and influencing factors of traditional Tibetan “Bengke” residential architecture in Luhuo County, Ganzi Tibetan Autonomous Prefecture, Sichuan Province. The study utilizes spatial statistical methods, including Average Nearest Neighbor Analysis, Getis-Ord Gi*, and Kernel Density Estimation, to identify significant clustering patterns of Bengke architecture. Spatial autocorrelation was tested using Moran’s Index, with results indicating no significant spatial autocorrelation, suggesting that the distribution mechanisms are complex and influenced by multiple factors. Additionally, exploratory data analysis (EDA), the Analytic Hierarchy Process (AHP), and regression methods such as Lasso and Elastic Net were used to identify and validate key factors influencing the distribution of these buildings. The analysis reveals that road density, population density, economic development quality, and industrial structure are the most significant factors. The study also highlights that these factors vary in impact between high-density and low-density areas, depending on the regional environment. These findings offer a comprehensive understanding of the spatial patterns of Bengke architecture and provide valuable insights for the preservation and sustainable development of this cultural heritage.
Factors influencing high and low aggregation areas.
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Siwei Yu; Ding Fan; Ma Ge; Zihang Chen (2024). Factors influencing high and low aggregation areas. [Dataset]. http://doi.org/10.1371/journal.pone.0314242.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0314242.t002
Dataset updated
Dec 16, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Siwei Yu; Ding Fan; Ma Ge; Zihang Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Factors influencing high and low aggregation areas.
Bellabeat Case Study Outline
kaggle.com
zip
Updated Jun 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sydney Yauney (2024). Bellabeat Case Study Outline [Dataset]. https://www.kaggle.com/datasets/sydneylynnyoung/bellabeat-case-study-clean-data/data
Explore at:
zip(5365 bytes)Available download formats
Dataset updated
Jun 25, 2024
Authors
Sydney Yauney
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
In this project, I was able to provide valuable insight to the Bellabeat marketing team through the process of cleaning and analyzing public smartwatch data in order to find smartwatch usage trends.

I followed all six of the data analysis steps in this project: Ask, Prepare, Process, Analyze, Share, and Act. This involved focusing on the business task, searching for credible data, cleaning and analyzing the data, creating simple and effective data vizualizations, and coming up with a final recommendation and presentation for stakeholders. The tools I chose to use were Bigquery and Tableau. I chose these due to the large size of the data.

The data attached is the result of the combination of 4 public datasets, found on Kaggle. All 4 datasets contain data from Fitbit users. The data was combined and cleaned, and then a second table was created, after parsing the weekday from the dates of the original data.

Below are some data vizualizations created to represent trends in this data.

Most Popular Time to Be Active - https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F18883869%2Fb33075a778528a9333bbce6d81d82ab4%2FMost%20Popular%20Time%20to%20Be%20Active.png?generation=1719356850677474&alt=media" alt="">

Most Popular Day to Be Active -

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F18883869%2Ff2279ab19b9bd5aaf2e44dfaf7320ddd%2FMost%20Popular%20Day%20to%20Be%20Active.png?generation=1719356897435191&alt=media" alt="">

After finding trends that showed that smartwatch users are most active on the weekends and during the evenings, my solution was to create a marketing campaign geared towards women who work a 9-5 sedentary-style job. The goal of the project, which was to provide valuable insight based off of public data, was accomplished through this recommendation. I cleaned and analyzed public Fitbit user data, identified trends, and provided an insightful reccomendation for a new focus on the marketing team.
Titanic Dataset
kaggle.com
zip
Updated Sep 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prince Rajak (2025). Titanic Dataset [Dataset]. https://www.kaggle.com/datasets/prince7489/titanic-dataset
Explore at:
zip(1849548 bytes)Available download formats
Dataset updated
Sep 29, 2025
Authors
Prince Rajak
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The Titanic Survival Prediction Project explores one of the most iconic datasets in data science. The goal is to predict whether a passenger survived the Titanic disaster based on key attributes such as age, gender, ticket class, family size, and fare.

Using a dataset of 100,000 synthetic records inspired by the original Titanic data, this project demonstrates a complete data science workflow — including data cleaning, exploratory data analysis (EDA), feature engineering, and predictive modeling.

By analyzing patterns (e.g., higher survival rates among women, children, and first-class passengers), the project showcases how machine learning can uncover meaningful insights from historical events.
Demographics of patients.
figshare.com
datasetcatalog.nlm.nih.gov
+1more
xls
Updated Jun 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prathiba Natesan; Dima Hadid; Yara Abou Harb; Eveline Hitti (2023). Demographics of patients. [Dataset]. http://doi.org/10.1371/journal.pone.0221087.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0221087.t001
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Prathiba Natesan; Dima Hadid; Yara Abou Harb; Eveline Hitti
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Demographics of patients.
Factor/pattern coefficients from exploratory factor analysis (EFA) and...
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prathiba Natesan; Dima Hadid; Yara Abou Harb; Eveline Hitti (2023). Factor/pattern coefficients from exploratory factor analysis (EFA) and confirmatory factor analysis (CFA). [Dataset]. http://doi.org/10.1371/journal.pone.0221087.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0221087.t002
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Prathiba Natesan; Dima Hadid; Yara Abou Harb; Eveline Hitti
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Factor/pattern coefficients from exploratory factor analysis (EFA) and confirmatory factor analysis (CFA).

Facebook

Twitter

Click to copy link

Link copied

Cite

Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586

Data Analysis for the Systematic Literature Review of DL4SE

Explore at:

Dataset updated

Jul 19, 2024

Dataset provided by

Washington and Lee University
College of William and Mary

Authors

Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise

Clear search

Close search

Google apps

Main menu

Data Analysis for the Systematic Literature Review of DL4SE

Google Data Analytics Capstone

Groups of words for our Z and X variables.

Process Mining Event Log - Incident Management

The five-step co-duction cycle.

Data from: Research and exploratory analysis driven - time-data...

nevo-reuven_fifa23-player-analysis

SEM regression for H1-5.

Feature contributions and top-three feature interactions (MFIs).

Smartphones Dataset (August 2024)

Smartphones Dataset (August 2024)

Dataset Versions:

Version 1: Raw Data (smartphones.csv or smartphones_uncleaned.csv - same files)

Version 2: Basic Cleaning (smartphones_cleaned_v1.csv)

Version 3: Intermediate Cleaning (smartphones_cleaned_v2.csv)

Version 4: Fully Cleaned and Processed Data (smartphones_cleaned_v3.csv)

Higgs bosons and a background process

Dataset

Contents

GEO_Processing_Exploratory_DGE_Analysis

99 Little Orange, Technical Business Case

99 Little Orange, Technical Business Case

Facebook User Engagement Data (29 chars)

Weights of variables in the indicator system.

Factors influencing high and low aggregation areas.

Bellabeat Case Study Outline

Titanic Dataset

Demographics of patients.

Factor/pattern coefficients from exploratory factor analysis (EFA) and...

Data Analysis for the Systematic Literature Review of DL4SE