100+ datasets found
  1. Z

    Data Analysis for the Systematic Literature Review of DL4SE

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    College of William and Mary
    Washington and Lee University
    Authors
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

    The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

    Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

    Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

    Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

    Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

    Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

    We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

    Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

    Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise

  2. Google Certificate BellaBeats Capstone Project

    • kaggle.com
    zip
    Updated Jan 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jason Porzelius (2023). Google Certificate BellaBeats Capstone Project [Dataset]. https://www.kaggle.com/datasets/jasonporzelius/google-certificate-bellabeats-capstone-project
    Explore at:
    zip(169161 bytes)Available download formats
    Dataset updated
    Jan 5, 2023
    Authors
    Jason Porzelius
    Description

    Introduction: I have chosen to complete a data analysis project for the second course option, Bellabeats, Inc., using a locally hosted database program, Excel for both my data analysis and visualizations. This choice was made primarily because I live in a remote area and have limited bandwidth and inconsistent internet access. Therefore, completing a capstone project using web-based programs such as R Studio, SQL Workbench, or Google Sheets was not a feasible choice. I was further limited in which option to choose as the datasets for the ride-share project option were larger than my version of Excel would accept. In the scenario provided, I will be acting as a Junior Data Analyst in support of the Bellabeats, Inc. executive team and data analytics team. This combined team has decided to use an existing public dataset in hopes that the findings from that dataset might reveal insights which will assist in Bellabeat's marketing strategies for future growth. My task is to provide data driven insights to business tasks provided by the Bellabeats, Inc.'s executive and data analysis team. In order to accomplish this task, I will complete all parts of the Data Analysis Process (Ask, Prepare, Process, Analyze, Share, Act). In addition, I will break each part of the Data Analysis Process down into three sections to provide clarity and accountability. Those three sections are: Guiding Questions, Key Tasks, and Deliverables. For the sake of space and to avoid repetition, I will record the deliverables for each Key Task directly under the numbered Key Task using an asterisk (*) as an identifier.

    Section 1 - Ask:

    A. Guiding Questions:
    1. Who are the key stakeholders and what are their goals for the data analysis project? 2. What is the business task that this data analysis project is attempting to solve?

    B. Key Tasks: 1. Identify key stakeholders and their goals for the data analysis project *The key stakeholders for this project are as follows: -Urška Sršen and Sando Mur - co-founders of Bellabeats, Inc. -Bellabeats marketing analytics team. I am a member of this team.

    1. Identify the business task. *The business task is: -As provided by co-founder Urška Sršen, the business task for this project is to gain insight into how consumers are using their non-BellaBeats smart devices in order to guide upcoming marketing strategies for the company which will help drive future growth. Specifically, the researcher was tasked with applying insights driven by the data analysis process to 1 BellaBeats product and presenting those insights to BellaBeats stakeholders.

    Section 2 - Prepare:

    A. Guiding Questions: 1. Where is the data stored and organized? 2. Are there any problems with the data? 3. How does the data help answer the business question?

    B. Key Tasks:

    1. Research and communicate the source of the data, and how it is stored/organized to stakeholders. *The data source used for our case study is FitBit Fitness Tracker Data. This dataset is stored in Kaggle and was made available through user Mobius in an open-source format. Therefore, the data is public and available to be copied, modified, and distributed, all without asking the user for permission. These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk reportedly (see credibility section directly below) between 03/12/2016 thru 05/12/2016.
      *Reportedly (see credibility section directly below), thirty eligible Fitbit users consented to the submission of personal tracker data, including output related to steps taken, calories burned, time spent sleeping, heart rate, and distance traveled. This data was broken down into minute, hour, and day level totals. This data is stored in 18 CSV documents. I downloaded all 18 documents into my local laptop and decided to use 2 documents for the purposes of this project as they were files which had merged activity and sleep data from the other documents. All unused documents were permanently deleted from the laptop. The 2 files used were: -sleepDay_merged.csv -dailyActivity_merged.csv

    2. Identify and communicate to stakeholders any problems found with the data related to credibility and bias. *As will be more specifically presented in the Process section, the data seems to have credibility issues related to the reported time frame of the data collected. The metadata seems to indicate that the data collected covered roughly 2 months of FitBit tracking. However, upon my initial data processing, I found that only 1 month of data was reported. *As will be more specifically presented in the Process section, the data has credibility issues related to the number of individuals who reported FitBit data. Specifically, the metadata communicates that 30 individual users agreed to report their tracking data. My initial data processing uncovered 33 individual ...

  3. V

    Dataset from New Data Analysis Methods for Actigraphy in Sleep Medicine

    • data-staging.niaid.nih.gov
    Updated Feb 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BioLINCC (a data-sharing platform funded by the National Institutes of Health); William Shannon, PhD (2025). Dataset from New Data Analysis Methods for Actigraphy in Sleep Medicine [Dataset]. http://doi.org/10.25934/00005414
    Explore at:
    Dataset updated
    Feb 6, 2025
    Dataset provided by
    Washington University School of Medicine
    Authors
    BioLINCC (a data-sharing platform funded by the National Institutes of Health); William Shannon, PhD
    Area covered
    United States
    Variables measured
    Apnea, Assessment Of Sleep Pattern
    Description

    The purpose of this study is to develop statistical and informatics tools for analyzing and visualizing Actical™ (actigraphy) data linked to fatigue in Sleep Medicine Center patients.

  4. m

    Replication Data for: Upcoming issues, new methods: using Interactive...

    • data.mendeley.com
    Updated Oct 18, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gustavo Behling (2021). Replication Data for: Upcoming issues, new methods: using Interactive Qualitative Analysis (IQA) in Management Research [Dataset]. http://doi.org/10.17632/kb76h5jtvw.1
    Explore at:
    Dataset updated
    Oct 18, 2021
    Authors
    Gustavo Behling
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These data refer to the paper “Upcoming issues, new methods: using Interactive Qualitative Analysis (IQA) in Management Research”. This article is a guide to the application of the IQA method in management research and the files available refer to: 1. 1-Affinities, definitions, and cards produced by focus group.docx: all cards, affinities and definitions create by focus group session.docx 2. 2-Step-by-step - Analysis procedures.docx: detailed data analysis procedures.docx 3. 3-Axial Coding Tables – Individual Interviews.docx: detailed axial coding procedures.docx 4. 4-Theoretical Coding Table – Individual Interviews.docx: detailed theoretical coding procedures.docx

  5. Water Rights Demand Analysis Methodology Datasets

    • data.cnra.ca.gov
    • data.ca.gov
    • +2more
    csv, xlsx
    Updated Apr 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    California State Water Resources Control Board (2022). Water Rights Demand Analysis Methodology Datasets [Dataset]. https://data.cnra.ca.gov/dataset/water-rights-demand-analysis-methodology-datasets
    Explore at:
    csv, xlsxAvailable download formats
    Dataset updated
    Apr 7, 2022
    Dataset authored and provided by
    California State Water Resources Control Board
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    The following datasets are used for the Water Rights Demand Analysis project and are formatted to be used in the calculations. The State Water Resources Control Board Division of Water Rights (Division) has developed a methodology to standardize and improve the accuracy of water diversion and use data that is used to determine water availability and inform water management and regulatory decisions. The Water Rights Demand Data Analysis Methodology (Methodology https://www.waterboards.ca.gov/drought/drought_tools_methods/demandanalysis.html ) is a series of data pre-processing steps, R Scripts, and data processing modules that identify and help address data quality issues related to both the self-reported water diversion and use data from water right holders or their agents and the Division of Water Rights electronic water rights data.

  6. Understanding and Managing Missing Data.pdf

    • figshare.com
    pdf
    Updated Jun 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ibrahim Denis Fofanah (2025). Understanding and Managing Missing Data.pdf [Dataset]. http://doi.org/10.6084/m9.figshare.29265155.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 9, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Ibrahim Denis Fofanah
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.

  7. Examples of boilerplate text from PLOS ONE papers based on targeted n-gram...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    • +1more
    xls
    Updated Jun 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicole M. White; Thirunavukarasu Balasubramaniam; Richi Nayak; Adrian G. Barnett (2023). Examples of boilerplate text from PLOS ONE papers based on targeted n-gram searches (sentence level). [Dataset]. http://doi.org/10.1371/journal.pone.0264360.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 14, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Nicole M. White; Thirunavukarasu Balasubramaniam; Richi Nayak; Adrian G. Barnett
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Examples of boilerplate text from PLOS ONE papers based on targeted n-gram searches (sentence level).

  8. Google Data Analytics Capstone

    • kaggle.com
    zip
    Updated Aug 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Reilly McCarthy (2022). Google Data Analytics Capstone [Dataset]. https://www.kaggle.com/datasets/reillymccarthy/google-data-analytics-capstone/discussion
    Explore at:
    zip(67456 bytes)Available download formats
    Dataset updated
    Aug 9, 2022
    Authors
    Reilly McCarthy
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Hello! Welcome to the Capstone project I have completed to earn my Data Analytics certificate through Google. I chose to complete this case study through RStudio desktop. The reason I did this is that R is the primary new concept I learned throughout this course. I wanted to embrace my curiosity and learn more about R through this project. In the beginning of this report I will provide the scenario of the case study I was given. After this I will walk you through my Data Analysis process based on the steps I learned in this course:

    1. Ask
    2. Prepare
    3. Process
    4. Analyze
    5. Share
    6. Act

    The data I used for this analysis comes from this FitBit data set: https://www.kaggle.com/datasets/arashnic/fitbit

    " This dataset generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. "

  9. s

    Data from: Data files used to study change dynamics in software systems

    • figshare.swinburne.edu.au
    pdf
    Updated Jul 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rajesh Vasa (2024). Data files used to study change dynamics in software systems [Dataset]. http://doi.org/10.25916/sut.26288227.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jul 22, 2024
    Dataset provided by
    Swinburne
    Authors
    Rajesh Vasa
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    It is a widely accepted fact that evolving software systems change and grow. However, it is less well-understood how change is distributed over time, specifically in object oriented software systems. The patterns and techniques used to measure growth permit developers to identify specific releases where significant change took place as well as to inform them of the longer term trend in the distribution profile. This knowledge assists developers in recording systemic and substantial changes to a release, as well as to provide useful information as input into a potential release retrospective. However, these analysis methods can only be applied after a mature release of the code has been developed. But in order to manage the evolution of complex software systems effectively, it is important to identify change-prone classes as early as possible. Specifically, developers need to know where they can expect change, the likelihood of a change, and the magnitude of these modifications in order to take proactive steps and mitigate any potential risks arising from these changes. Previous research into change-prone classes has identified some common aspects, with different studies suggesting that complex and large classes tend to undergo more changes and classes that changed recently are likely to undergo modifications in the near future. Though the guidance provided is helpful, developers need more specific guidance in order for it to be applicable in practice. Furthermore, the information needs to be available at a level that can help in developing tools that highlight and monitor evolution prone parts of a system as well as support effort estimation activities. The specific research questions that we address in this chapter are: (1) What is the likelihood that a class will change from a given version to the next? (a) Does this probability change over time? (b) Is this likelihood project specific, or general? (2) How is modification frequency distributed for classes that change? (3) What is the distribution of the magnitude of change? Are most modifications minor adjustments, or substantive modifications? (4) Does structural complexity make a class susceptible to change? (5) Does popularity make a class more change-prone? We make recommendations that can help developers to proactively monitor and manage change. These are derived from a statistical analysis of change in approximately 55000 unique classes across all projects under investigation. The analysis methods that we applied took into consideration the highly skewed nature of the metric data distributions. The raw metric data (4 .txt files and 4 .log files in a .zip file measuring ~2MB in total) is provided as a comma separated values (CSV) file, and the first line of the CSV file contains the header. A detailed output of the statistical analysis undertaken is provided as log files generated directly from Stata (statistical analysis software).

  10. w

    Dataset of books called Statistical and computational methods in data...

    • workwithdata.com
    Updated Apr 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of books called Statistical and computational methods in data analysis [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Statistical+and+computational+methods+in+data+analysis
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about books. It has 2 rows and is filtered where the book is Statistical and computational methods in data analysis. It features 7 columns including author, publication date, language, and book publisher.

  11. f

    Appendix C. Summary of data analysis and randomization procedure.

    • datasetcatalog.nlm.nih.gov
    • wiley.figshare.com
    Updated Aug 4, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Villéger, Sébastien; Hernández, Domingo Flores; Miranda, Julia Ramos; Mouillot, David (2016). Appendix C. Summary of data analysis and randomization procedure. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001525567
    Explore at:
    Dataset updated
    Aug 4, 2016
    Authors
    Villéger, Sébastien; Hernández, Domingo Flores; Miranda, Julia Ramos; Mouillot, David
    Description

    Summary of data analysis and randomization procedure.

  12. d

    Data from: A simple method for statistical analysis of intensity differences...

    • catalog.data.gov
    • healthdata.gov
    • +1more
    Updated Sep 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institutes of Health (2025). A simple method for statistical analysis of intensity differences in microarray-derived gene expression data [Dataset]. https://catalog.data.gov/dataset/a-simple-method-for-statistical-analysis-of-intensity-differences-in-microarray-derived-ge
    Explore at:
    Dataset updated
    Sep 7, 2025
    Dataset provided by
    National Institutes of Health
    Description

    Background Microarray experiments offer a potent solution to the problem of making and comparing large numbers of gene expression measurements either in different cell types or in the same cell type under different conditions. Inferences about the biological relevance of observed changes in expression depend on the statistical significance of the changes. In lieu of many replicates with which to determine accurate intensity means and variances, reliable estimates of statistical significance remain problematic. Without such estimates, overly conservative choices for significance must be enforced. Results A simple statistical method for estimating variances from microarray control data which does not require multiple replicates is presented. Comparison of datasets from two commercial entities using this difference-averaging method demonstrates that the standard deviation of the signal scales at a level intermediate between the signal intensity and its square root. Application of the method to a dataset related to the β-catenin pathway yields a larger number of biologically reasonable genes whose expression is altered than the ratio method. Conclusions The difference-averaging method enables determination of variances as a function of signal intensities by averaging over the entire dataset. The method also provides a platform-independent view of important statistical properties of microarray data.

  13. V

    Data from: Meta-analysis: Neither quick nor easy

    • data.virginia.gov
    • catalog.data.gov
    html
    Updated Sep 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institutes of Health (2025). Meta-analysis: Neither quick nor easy [Dataset]. https://data.virginia.gov/dataset/meta-analysis-neither-quick-nor-easy
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Sep 6, 2025
    Dataset provided by
    National Institutes of Health
    Description

    Background Meta-analysis is often considered to be a simple way to summarize the existing literature. In this paper we describe how a meta-analysis resembles a conventional study, requiring a written protocol with design elements that parallel those of a record review.

       Methods
       The paper provides a structure for creating a meta-analysis protocol. Some guidelines for measurement of the quality of papers are given. A brief overview of statistical considerations is included. Four papers are reviewed as examples. The examples generally followed the guidelines we specify in reporting the studies and results, but in some of the papers there was insufficient information on the meta-analysis process.
    
    
       Conclusions
       Meta-analysis can be a very useful method to summarize data across many studies, but it requires careful thought, planning and implementation.
    
  14. Software tools used for data collection and analysis.

    • plos.figshare.com
    xls
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John A. Borghi; Ana E. Van Gulick (2023). Software tools used for data collection and analysis. [Dataset]. http://doi.org/10.1371/journal.pone.0252047.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    John A. Borghi; Ana E. Van Gulick
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Software tools used to collect and analyze data. Parentheses for analysis software indicate the tools participants were taught to use as part of their education in research methods and statistics. “Other” responses for data collection software were largely comprised of survey tools (e.g. Survey Monkey, LimeSurvey) and tools for building and running behavioral experiments (e.g. Gorilla, JsPsych). “Other” responses for data analysis software largely consisted of neuroimaging-related tools (e.g. SPM, AFNI).

  15. s

    Digital Data Analytics, Public Engagement and the Social Life of Methods

    • orda.shef.ac.uk
    docx
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Helen Kennedy; Giles Moss; Stylianos Moshanas; Chris Birchall (2023). Digital Data Analytics, Public Engagement and the Social Life of Methods [Dataset]. http://doi.org/10.15131/shef.data.5194993.v1
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    The University of Sheffield
    Authors
    Helen Kennedy; Giles Moss; Stylianos Moshanas; Chris Birchall
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Interview and workshop transcripts from EPSRC Digital Transformations Communities and Cultures Network + (http://www.communitiesandculture.org/) project Digital Data Analytics, Public Engagement and the Social Life of Methods (http://www.communitiesandculture.org/projects/digital-data-analysis/). Methodology described in papers available at the above link.

  16. Sensitivity Analysis Tools for Clinical Trials with Missing Data [Methods...

    • icpsr.umich.edu
    Updated Sep 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scharfstein, Daniel O. (2025). Sensitivity Analysis Tools for Clinical Trials with Missing Data [Methods Study], 2013-2018 [Dataset]. http://doi.org/10.3886/ICPSR39492.v1
    Explore at:
    Dataset updated
    Sep 15, 2025
    Dataset provided by
    Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
    Authors
    Scharfstein, Daniel O.
    License

    https://www.icpsr.umich.edu/web/ICPSR/studies/39492/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/39492/terms

    Time period covered
    2013 - 2018
    Area covered
    United States
    Description

    Clinical trials study the effects of medical treatments, like how safe they are and how well they work. But most clinical trials don't get all the data they need from patients. Patients may not answer all questions on a survey, or they may drop out of a study after it has started. The missing data can affect researchers' ability to detect the effects of treatments. To address the problem of missing data, researchers can make different guesses based on why and how data are missing. Then they can look at results for each guess. If results based on different guesses are similar, researchers can have more confidence that the study results are accurate. In this study, the research team created new methods to do these tests and developed software that runs these tests. To access the sensitivity analysis methods and software, please visit the MissingDataMatters website.

  17. S

    Coding and Categorization of Data Collection and Analysis Methods

    • scidb.cn
    Updated Sep 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Liu Ye (2025). Coding and Categorization of Data Collection and Analysis Methods [Dataset]. http://doi.org/10.57760/sciencedb.27679
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 9, 2025
    Dataset provided by
    Science Data Bank
    Authors
    Liu Ye
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Examples of Coding and Inductive Classification of Data Collection and Analysis Methods

  18. cases study1 example for google data analytics

    • kaggle.com
    zip
    Updated Apr 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mohammed hatem (2023). cases study1 example for google data analytics [Dataset]. https://www.kaggle.com/datasets/mohammedhatem/cases-study1-example-for-google-data-analytics
    Explore at:
    zip(25278847 bytes)Available download formats
    Dataset updated
    Apr 22, 2023
    Authors
    mohammed hatem
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    In the way of my journey to earn the google data analytics certificate I will practice real world example by following the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Picking the Bellabeat example.

  19. Reliance on data & analysis for marketing decisions in Western Europe 2024

    • statista.com
    Updated May 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Reliance on data & analysis for marketing decisions in Western Europe 2024 [Dataset]. https://www.statista.com/statistics/1465527/reliance-data-analysis-marketing-decisions-europe/
    Explore at:
    Dataset updated
    May 15, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    May 2024
    Area covered
    Europe
    Description

    During a survey carried out in 2024, roughly one in three marketing managers from France, Germany, and the United Kingdom stated that they based every marketing decision on data. Under ** percent of respondents in all five surveyed countries said they struggled to incorporate data analytics into their decision-making process.

  20. Data from: Analytical Procedures for Determining the Impacts of Reliability...

    • catalog.data.gov
    • data.bts.gov
    • +1more
    Updated Dec 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Federal Highway Administration (2023). Analytical Procedures for Determining the Impacts of Reliability Mitigation Strategies [supporting datasets] [Dataset]. https://catalog.data.gov/dataset/analytical-procedures-for-determining-the-impacts-of-reliability-mitigation-strategies-sup
    Explore at:
    Dataset updated
    Dec 7, 2023
    Dataset provided by
    Federal Highway Administrationhttps://highways.dot.gov/
    Description

    The objective of this project was to develop technical relationships between reliability improvement strategies and reliability performance metrics. This project defined reliability, explained the importance of travel time distributions for measuring reliability, and recommended specific reliability performance measures. The research reexamined the contribution of the various causes of nonrecurring congestion on urban freeway sections, however, some attention was also given to rural highways and urban arterials). Numerous actions that can potentially reduce nonrecurring congestion were identified with an indication of their relative importance. Models for predicting nonrecurring congestion were developed using three methods, all based on empirical procedures: The first involved before and after studies; the second was termed a 'data poor' approach and resulted in a parsimonious and easy-to-apply set of models; the third was entitled a 'data rich model' and used cross-section inputs including data on selected factors known to directly affect nonrecurring congestion. An important conclusion of the study is that actions to improve operations, reduce demand, and increase capacity all can improve travel time reliability. The 3 attached zip files contains comma separated value (.csv) files of data to support SHRP 2 report S2-L03-RR-1, Analytical procedures for determining the impacts of reliability mitigation strategies.Zip size is 1.83 MB. Files were accessed in Microsoft Excel 2016. Data will be preserved as is. To view publication see: https://rosap.ntl.bts.gov/view/dot/3605

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586

Data Analysis for the Systematic Literature Review of DL4SE

Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
College of William and Mary
Washington and Lee University
Authors
Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise

Search
Clear search
Close search
Google apps
Main menu