100+ datasets found
  1. Google Certificate BellaBeats Capstone Project

    • kaggle.com
    zip
    Updated Jan 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jason Porzelius (2023). Google Certificate BellaBeats Capstone Project [Dataset]. https://www.kaggle.com/datasets/jasonporzelius/google-certificate-bellabeats-capstone-project
    Explore at:
    zip(169161 bytes)Available download formats
    Dataset updated
    Jan 5, 2023
    Authors
    Jason Porzelius
    Description

    Introduction: I have chosen to complete a data analysis project for the second course option, Bellabeats, Inc., using a locally hosted database program, Excel for both my data analysis and visualizations. This choice was made primarily because I live in a remote area and have limited bandwidth and inconsistent internet access. Therefore, completing a capstone project using web-based programs such as R Studio, SQL Workbench, or Google Sheets was not a feasible choice. I was further limited in which option to choose as the datasets for the ride-share project option were larger than my version of Excel would accept. In the scenario provided, I will be acting as a Junior Data Analyst in support of the Bellabeats, Inc. executive team and data analytics team. This combined team has decided to use an existing public dataset in hopes that the findings from that dataset might reveal insights which will assist in Bellabeat's marketing strategies for future growth. My task is to provide data driven insights to business tasks provided by the Bellabeats, Inc.'s executive and data analysis team. In order to accomplish this task, I will complete all parts of the Data Analysis Process (Ask, Prepare, Process, Analyze, Share, Act). In addition, I will break each part of the Data Analysis Process down into three sections to provide clarity and accountability. Those three sections are: Guiding Questions, Key Tasks, and Deliverables. For the sake of space and to avoid repetition, I will record the deliverables for each Key Task directly under the numbered Key Task using an asterisk (*) as an identifier.

    Section 1 - Ask:

    A. Guiding Questions:
    1. Who are the key stakeholders and what are their goals for the data analysis project? 2. What is the business task that this data analysis project is attempting to solve?

    B. Key Tasks: 1. Identify key stakeholders and their goals for the data analysis project *The key stakeholders for this project are as follows: -Urška Sršen and Sando Mur - co-founders of Bellabeats, Inc. -Bellabeats marketing analytics team. I am a member of this team.

    1. Identify the business task. *The business task is: -As provided by co-founder Urška Sršen, the business task for this project is to gain insight into how consumers are using their non-BellaBeats smart devices in order to guide upcoming marketing strategies for the company which will help drive future growth. Specifically, the researcher was tasked with applying insights driven by the data analysis process to 1 BellaBeats product and presenting those insights to BellaBeats stakeholders.

    Section 2 - Prepare:

    A. Guiding Questions: 1. Where is the data stored and organized? 2. Are there any problems with the data? 3. How does the data help answer the business question?

    B. Key Tasks:

    1. Research and communicate the source of the data, and how it is stored/organized to stakeholders. *The data source used for our case study is FitBit Fitness Tracker Data. This dataset is stored in Kaggle and was made available through user Mobius in an open-source format. Therefore, the data is public and available to be copied, modified, and distributed, all without asking the user for permission. These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk reportedly (see credibility section directly below) between 03/12/2016 thru 05/12/2016.
      *Reportedly (see credibility section directly below), thirty eligible Fitbit users consented to the submission of personal tracker data, including output related to steps taken, calories burned, time spent sleeping, heart rate, and distance traveled. This data was broken down into minute, hour, and day level totals. This data is stored in 18 CSV documents. I downloaded all 18 documents into my local laptop and decided to use 2 documents for the purposes of this project as they were files which had merged activity and sleep data from the other documents. All unused documents were permanently deleted from the laptop. The 2 files used were: -sleepDay_merged.csv -dailyActivity_merged.csv

    2. Identify and communicate to stakeholders any problems found with the data related to credibility and bias. *As will be more specifically presented in the Process section, the data seems to have credibility issues related to the reported time frame of the data collected. The metadata seems to indicate that the data collected covered roughly 2 months of FitBit tracking. However, upon my initial data processing, I found that only 1 month of data was reported. *As will be more specifically presented in the Process section, the data has credibility issues related to the number of individuals who reported FitBit data. Specifically, the metadata communicates that 30 individual users agreed to report their tracking data. My initial data processing uncovered 33 individual ...

  2. f

    DataSheet1_Exploratory data analysis (EDA) machine learning approaches for...

    • frontiersin.figshare.com
    docx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victoria Da Poian; Bethany Theiling; Lily Clough; Brett McKinney; Jonathan Major; Jingyi Chen; Sarah Hörst (2023). DataSheet1_Exploratory data analysis (EDA) machine learning approaches for ocean world analog mass spectrometry.docx [Dataset]. http://doi.org/10.3389/fspas.2023.1134141.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers
    Authors
    Victoria Da Poian; Bethany Theiling; Lily Clough; Brett McKinney; Jonathan Major; Jingyi Chen; Sarah Hörst
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    World
    Description

    Many upcoming and proposed missions to ocean worlds such as Europa, Enceladus, and Titan aim to evaluate their habitability and the existence of potential life on these moons. These missions will suffer from communication challenges and technology limitations. We review and investigate the applicability of data science and unsupervised machine learning (ML) techniques on isotope ratio mass spectrometry data (IRMS) from volatile laboratory analogs of Europa and Enceladus seawaters as a case study for development of new strategies for icy ocean world missions. Our driving science goal is to determine whether the mass spectra of volatile gases could contain information about the composition of the seawater and potential biosignatures. We implement data science and ML techniques to investigate what inherent information the spectra contain and determine whether a data science pipeline could be designed to quickly analyze data from future ocean worlds missions. In this study, we focus on the exploratory data analysis (EDA) step in the analytics pipeline. This is a crucial unsupervised learning step that allows us to understand the data in depth before subsequent steps such as predictive/supervised learning. EDA identifies and characterizes recurring patterns, significant correlation structure, and helps determine which variables are redundant and which contribute to significant variation in the lower dimensional space. In addition, EDA helps to identify irregularities such as outliers that might be due to poor data quality. We compared dimensionality reduction methods Uniform Manifold Approximation and Projection (UMAP) and Principal Component Analysis (PCA) for transforming our data from a high-dimensional space to a lower dimension, and we compared clustering algorithms for identifying data-driven groups (“clusters”) in the ocean worlds analog IRMS data and mapping these clusters to experimental conditions such as seawater composition and CO2 concentration. Such data analysis and characterization efforts are the first steps toward the longer-term science autonomy goal where similar automated ML tools could be used onboard a spacecraft to prioritize data transmissions for bandwidth-limited outer Solar System missions.

  3. Data Insight: Google Analytics Capstone Project

    • kaggle.com
    zip
    Updated Mar 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sinderpreet (2024). Data Insight: Google Analytics Capstone Project [Dataset]. https://www.kaggle.com/datasets/sinderpreet/datainsight-google-analytics-capstone-project
    Explore at:
    zip(215409585 bytes)Available download formats
    Dataset updated
    Mar 2, 2024
    Authors
    sinderpreet
    License

    https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/

    Description

    Case study: How does a bike-share navigate speedy success?

    Scenario:

    As a data analyst on Cyclistic's marketing team, our focus is on enhancing annual memberships to drive the company's success. We aim to analyze the differing usage patterns between casual riders and annual members to craft a marketing strategy aimed at converting casual riders. Our recommendations, supported by data insights and professional visualizations, await Cyclistic executives' approval to proceed.

    About the company

    In 2016, Cyclistic launched a bike-share program in Chicago, growing to 5,824 bikes and 692 stations. Initially, their marketing aimed at broad segments with flexible pricing plans attracting both casual riders (single-ride or full-day passes) and annual members. However, recognizing that annual members are more profitable, Cyclistic is shifting focus to convert casual riders into annual members. To achieve this, they plan to analyze historical bike trip data to understand the differences and preferences between the two user groups, aiming to tailor marketing strategies that encourage casual riders to purchase annual memberships.

    Project Overview:

    This capstone project is a culmination of the skills and knowledge acquired through the Google Professional Data Analytics Certification. It focuses on Track 1, which is centered around Cyclistic, a fictional bike-share company modeled to reflect real-world data analytics scenarios in the transportation and service industry.

    Dataset Acknowledgment:

    We are grateful to Motivate Inc. for providing the dataset that serves as the foundation of this capstone project. Their contribution has enabled us to apply practical data analytics techniques to a real-world dataset, mirroring the challenges and opportunities present in the bike-sharing sector.

    Objective:

    The primary goal of this project is to analyze the Cyclistic dataset to uncover actionable insights that could help the company optimize its operations, improve customer satisfaction, and increase its market share. Through comprehensive data exploration, cleaning, analysis, and visualization, we aim to identify patterns and trends that inform strategic business decisions.

    Methodology:

    Data Collection: Utilizing the dataset provided by Motivate Inc., which includes detailed information on bike usage, customer behavior, and operational metrics. Data Cleaning and Preparation: Ensuring the dataset is accurate, complete, and ready for analysis by addressing any inconsistencies, missing values, or anomalies. Data Analysis: Applying statistical methods and data analytics techniques to extract meaningful insights from the dataset.

    Visualization and Reporting:

    Creating intuitive and compelling visualizations to present the findings clearly and effectively, facilitating data-driven decision-making. Findings and Recommendations:

    Conclusion:

    The Cyclistic Capstone Project not only demonstrates the practical application of data analytics skills in a real-world scenario but also provides valuable insights that can drive strategic improvements for Cyclistic. Through this project, showcasing the power of data analytics in transforming data into actionable knowledge, underscoring the importance of data-driven decision-making in today's competitive business landscape.

    Acknowledgments:

    Special thanks to Motivate Inc. for their support and for providing the dataset that made this project possible. Their contribution is immensely appreciated and has significantly enhanced the learning experience.

    STRATEGIES USED

    Case Study Roadmap - ASK

    ●What is the problem you are trying to solve? ●How can your insights drive business decisions?

    Key Tasks ● Identify the business task ● Consider key stakeholders

    Deliverable ● A clear statement of the business task

    Case Study Roadmap - PREPARE

    ● Where is your data located? ● Are there any problems with the data?

    Key tasks ● Download data and store it appropriately. ● Identify how it’s organized.

    Deliverable ● A description of all data sources used

    Case Study Roadmap - PROCESS

    ● What tools are you choosing and why? ● What steps have you taken to ensure that your data is clean?

    Key tasks ● Choose your tools. ● Document the cleaning process.

    Deliverable ● Documentation of any cleaning or manipulation of data

    Case Study Roadmap - ANALYZE

    ● Has your data been properly formaed? ● How will these insights help answer your business questions?

    Key tasks ● Perform calculations ● Formatting

    Deliverable ● A summary of analysis

    Case Study Roadmap - SHARE

    ● Were you able to answer all questions of stakeholders? ● Can Data visualization help you share findings?

    Key tasks ● Present your findings ● Create effective data viz.

    Deliverable ● Supporting viz and key findings

    **Case Study Roadmap - A...

  4. Z

    Data Analysis for the Systematic Literature Review of DL4SE

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    College of William and Mary
    Washington and Lee University
    Authors
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

    The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

    Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

    Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

    Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

    Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

    Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

    We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

    Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

    Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise

  5. f

    Data from: Understanding Data Analysis Steps in Mass-Spectrometry-Based...

    • figshare.com
    zip
    Updated Sep 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nadezhda T. Doncheva; Veit Schwämmle; Marie Locard-Paulet (2025). Understanding Data Analysis Steps in Mass-Spectrometry-Based Proteomics Is Key to Transparent Reporting [Dataset]. http://doi.org/10.1021/acs.jproteome.5c00287.s002
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 3, 2025
    Dataset provided by
    ACS Publications
    Authors
    Nadezhda T. Doncheva; Veit Schwämmle; Marie Locard-Paulet
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Mass spectrometry (MS)-based proteomics data analysis is composed of many stages from quality control, data cleaning, and normalization to statistical and functional analysis, without forgetting multiple visualization steps. All of these need to be reported next to published results to make them fully understandable and reusable for the community. Although this seems straightforward, exhaustively reporting all aspects of an analysis workflow can be tedious and error prone. This letter reports good practices when describing data analysis of MS-based proteomics data and discusses why and how the community should put efforts into more transparently reporting data analysis workflows.

  6. Collection of example datasets used for the book - R Programming -...

    • figshare.com
    txt
    Updated Dec 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kingsley Okoye; Samira Hosseini (2023). Collection of example datasets used for the book - R Programming - Statistical Data Analysis in Research [Dataset]. http://doi.org/10.6084/m9.figshare.24728073.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 4, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Kingsley Okoye; Samira Hosseini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This book is written for statisticians, data analysts, programmers, researchers, teachers, students, professionals, and general consumers on how to perform different types of statistical data analysis for research purposes using the R programming language. R is an open-source software and object-oriented programming language with a development environment (IDE) called RStudio for computing statistics and graphical displays through data manipulation, modelling, and calculation. R packages and supported libraries provides a wide range of functions for programming and analyzing of data. Unlike many of the existing statistical softwares, R has the added benefit of allowing the users to write more efficient codes by using command-line scripting and vectors. It has several built-in functions and libraries that are extensible and allows the users to define their own (customized) functions on how they expect the program to behave while handling the data, which can also be stored in the simple object system.For all intents and purposes, this book serves as both textbook and manual for R statistics particularly in academic research, data analytics, and computer programming targeted to help inform and guide the work of the R users or statisticians. It provides information about different types of statistical data analysis and methods, and the best scenarios for use of each case in R. It gives a hands-on step-by-step practical guide on how to identify and conduct the different parametric and non-parametric procedures. This includes a description of the different conditions or assumptions that are necessary for performing the various statistical methods or tests, and how to understand the results of the methods. The book also covers the different data formats and sources, and how to test for reliability and validity of the available datasets. Different research experiments, case scenarios and examples are explained in this book. It is the first book to provide a comprehensive description and step-by-step practical hands-on guide to carrying out the different types of statistical analysis in R particularly for research purposes with examples. Ranging from how to import and store datasets in R as Objects, how to code and call the methods or functions for manipulating the datasets or objects, factorization, and vectorization, to better reasoning, interpretation, and storage of the results for future use, and graphical visualizations and representations. Thus, congruence of Statistics and Computer programming for Research.

  7. w

    Dataset of books called Data analysis in business research : a step-by-step...

    • workwithdata.com
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of books called Data analysis in business research : a step-by-step nonparametric approach [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Data+analysis+in+business+research+%3A+a+step-by-step+nonparametric+approach
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about books. It has 1 row and is filtered where the book is Data analysis in business research : a step-by-step nonparametric approach. It features 7 columns including author, publication date, language, and book publisher.

  8. Netflix Data Analysis

    • kaggle.com
    Updated Oct 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ankul Sharma (2024). Netflix Data Analysis [Dataset]. https://www.kaggle.com/datasets/ankulsharma150/netflix-data-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 15, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ankul Sharma
    Description

    Introduction

    This datasets about Netflix Movies & TV Shows. Datasets have 12 columns with some null values. To analysis of dataset are used Pandas, plotly.express and Datetime libraries. Analysis process I divided into several parts for step wise analysis and to find out trending questions on social media for Bollywood actors and actress.

    Data Manipulation

    Missing Data

    There are many representations of missing data. They are Null values, missing values. I used some of methods used in data analysis process to clean missing values.

    Data Munging

    String Method

    There I used some string method on column such as 'cast', 'Lested_in' to extract data

    Datetime data type

    Converting an object type into datatype objects with the to_datetime function then we have a datatime object, can extract various part of data such as year, month and day

    EDA

    Here, I find out several eye catching question. the following questions are like as- - Show the all Movies & TV Shows released by month - Count the all types of unique rating & which rating are with most number - Salman, Shah Rukh and Akshay Kumar all movie - Find out the Movies & Series have Maximum time length - Year on Year show added on Netflix by its type - Akshay Kumar all comedies movies, Shah Rukh movies with Kajol and Salman-Akshay Movies - Who Director has made the most TV Shows - Actors and Actress who have given most Number of Movies - Find out which types of genre has most movies and TV Shows

  9. cases study1 example for google data analytics

    • kaggle.com
    zip
    Updated Apr 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mohammed hatem (2023). cases study1 example for google data analytics [Dataset]. https://www.kaggle.com/datasets/mohammedhatem/cases-study1-example-for-google-data-analytics
    Explore at:
    zip(25278847 bytes)Available download formats
    Dataset updated
    Apr 22, 2023
    Authors
    mohammed hatem
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    In the way of my journey to earn the google data analytics certificate I will practice real world example by following the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Picking the Bellabeat example.

  10. f

    Data from: MCnebula: Critical Chemical Classes for the Classification and...

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    • +1more
    xlsx
    Updated Jun 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lichuang Huang; Qiyuan Shan; Qiang Lyu; Shuosheng Zhang; Lu Wang; Gang Cao (2023). MCnebula: Critical Chemical Classes for the Classification and Boost Identification by Visualization for Untargeted LC–MS/MS Data Analysis [Dataset]. http://doi.org/10.1021/acs.analchem.3c01072.s005
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 14, 2023
    Dataset provided by
    ACS Publications
    Authors
    Lichuang Huang; Qiyuan Shan; Qiang Lyu; Shuosheng Zhang; Lu Wang; Gang Cao
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Untargeted mass spectrometry is a robust tool for biology, but it usually requires a large amount of time on data analysis, especially for system biology. A framework called Multiple-Chemical nebula (MCnebula) was developed herein to facilitate the LC–MS data analysis process by focusing on critical chemical classes and visualization in multiple dimensions. This framework consists of three vital steps as follows: (1) abundance-based classes (ABC) selection algorithm, (2) critical chemical classes to classify “features” (corresponding to compounds), and (3) visualization as multiple Child-Nebulae (network graph) with annotation, chemical classification, and structure. Notably, MCnebula can be used to explore the classification and structural characteristic of unknown compounds beyond the limit of the spectral library. Moreover, it is intuitive and convenient for pathway analysis and biomarker discovery because of its function of ABC selection and visualization. MCnebula was implemented in the R language. A series of tools in R packages were provided to facilitate downstream analysis in an MCnebula-featured way, including feature selection, homology tracing of top features, pathway enrichment analysis, heat map clustering analysis, spectral visualization analysis, chemical information query, and output analysis reports. The broad utility of MCnebula was illustrated by a human-derived serum data set for metabolomics analysis. The results indicated that “Acyl carnitines” were screened out by tracing structural classes of biomarkers, which was consistent with the reference. A plant-derived data set was investigated to achieve a rapid annotation and discovery of compounds in E. ulmoides.

  11. f

    A Two-Step Method for smFRET Data Analysis

    • datasetcatalog.nlm.nih.gov
    • acs.figshare.com
    Updated Jul 14, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Piecco, Kurt Waldo Sy; Pyle, Joseph R.; Chen, Jixin; Kolomeisky, Anatoly B.; Landes, Christy F. (2016). A Two-Step Method for smFRET Data Analysis [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001570037
    Explore at:
    Dataset updated
    Jul 14, 2016
    Authors
    Piecco, Kurt Waldo Sy; Pyle, Joseph R.; Chen, Jixin; Kolomeisky, Anatoly B.; Landes, Christy F.
    Description

    We demonstrate a two-step data analysis method to increase the accuracy of single-molecule Förster Resonance Energy Transfer (smFRET) experiments. Most current smFRET studies are at a time resolution on the millisecond level. When the system also contains molecular dynamics on the millisecond level, simulations show that large errors are present (e.g., > 40%) because false state assignment becomes significant during data analysis. We introduce and confirm an additional step after normal smFRET data analysis that is able to reduce the error (e.g., < 10%). The idea is to use Monte Carlo simulation to search ideal smFRET trajectories and compare them to the experimental data. Using a mathematical model, we are able to find the matches between these two sets, and back guess the hidden rate constants in the experimental results.

  12. Financial Data Analysis Process

    • figshare.com
    xml
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CONG LIU (2023). Financial Data Analysis Process [Dataset]. http://doi.org/10.6084/m9.figshare.23488436.v2
    Explore at:
    xmlAvailable download formats
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    CONG LIU
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    · Financial expenses1 dataset: This dataset consists of simulated event logs generated from the financial expense data analysis process model. Each trace provides a detailed description of the process of analyzing office expense data. · Financial expenses2 dataset: This dataset consists of simulated event logs generated from the travel expense data analysis process model. Each trace provides a detailed description of the process of analyzing travel expense data. · Financial expenses3 dataset: This dataset consists of simulated event logs generated from the sales expense data analysis process model. Each trace provides a detailed description of the process of analyzing sales expense data. · Financial expenses4 dataset: This dataset consists of simulated event logs generated from the management expense data analysis process model. Each trace provides a detailed description of the process of analyzing management expense data. · Financial expenses5 dataset: This dataset consists of simulated event logs generated from the manufacturing expense data analysis process model. Each trace provides a detailed description of the process of analyzing manufacturing expense data. · Financial expenses6 dataset: This dataset consists of simulated event logs generated from the financial statement data analysis process model. Each trace provides a detailed description of the process of analyzing financial statement data.

  13. d

    Data from: USAGE OF DISSIMILARITY MEASURES AND MULTIDIMENSIONAL SCALING FOR...

    • catalog.data.gov
    • s.cnmilf.com
    Updated Apr 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). USAGE OF DISSIMILARITY MEASURES AND MULTIDIMENSIONAL SCALING FOR LARGE SCALE SOLAR DATA ANALYSIS [Dataset]. https://catalog.data.gov/dataset/usage-of-dissimilarity-measures-and-multidimensional-scaling-for-large-scale-solar-data-an
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    USAGE OF DISSIMILARITY MEASURES AND MULTIDIMENSIONAL SCALING FOR LARGE SCALE SOLAR DATA ANALYSIS Juan M Banda, Rafal Anrgyk ABSTRACT: This work describes the application of several dissimilarity measures combined with multidimensional scaling for large scale solar data analysis. Using the first solar domain-specific benchmark data set that contains multiple types of phenomena, we investigated combination of different image parameters with different dissimilarity measure sin order to determine which combination will allow us to differentiate our solar data within each class and versus the rest of the classes. In this work we also address the issue of reducing dimensionality by applying multidimensional scaling to our dissimilarity matrices produced by the previously mentioned combination. By applying multidimensional scaling we can investigate how many resulting components are needed in order to maintain a good representation of our data (in an artificial dimensional space) and how many can be discarded in order to economize our storage costs. We present a comparative analysis between different classifiers in order to determine the amount of dimensionality reduction that can be achieved with said combination of image parameters, similarity measure and multidimensional scaling.

  14. Google Data Analytics Capstone Project

    • kaggle.com
    zip
    Updated Nov 13, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NANCY CHAUHAN (2021). Google Data Analytics Capstone Project [Dataset]. https://www.kaggle.com/datasets/nancychauhan199/google-case-study-pdf
    Explore at:
    zip(284279 bytes)Available download formats
    Dataset updated
    Nov 13, 2021
    Authors
    NANCY CHAUHAN
    Description

    Case Study: How Does a Bike-Share Navigate Speedy Success?¶

    Introduction

    Welcome to the Cyclistic bike-share analysis case study! In this case study, you will perform many real-world tasks of a junior data analyst. You will work for a fictional company, Cyclistic, and meet different characters and team members. In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Along the way, the Case Study Roadmap tables — including guiding questions and key tasks — will help you stay on the right path. By the end of this lesson, you will have a portfolio-ready case study. Download the packet and reference the details of this case study anytime. Then, when you begin your job hunt, your case study will be a tangible way to demonstrate your knowledge and skills to potential employers.

    Scenario

    You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations. Characters and teams ● Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day. ● Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels. ● Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals — as well as how you, as a junior data analyst, can help Cyclistic achieve them. ● Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.

    About the company

    In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs. Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends

    Three questions will guide the future marketing program:

    How do annual members and casual riders use Cyclistic bikes differently? Why would casual riders buy Cyclistic annual memberships? How can Cyclistic use digital media to influence casual riders to become members? Moreno has assigned you the first question to answer: How do annual members and casual rid...

  15. f

    Steps of our qualitative data analysis.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Apr 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Butler, Jorie; Wang, Ching-Yu; Wallace, Andrea S.; Sharareh, Nasser (2023). Steps of our qualitative data analysis. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000949773
    Explore at:
    Dataset updated
    Apr 20, 2023
    Authors
    Butler, Jorie; Wang, Ching-Yu; Wallace, Andrea S.; Sharareh, Nasser
    Description

    BackgroundFood insecurity is a social determinant of health that impacts more than 10% of U.S. households every year. Many unexpected events make food-insecure people and those with unmet food needs seek information and help from both formal (e.g., community organizations) and informal (e.g., family/friends) resources. Food-related information seeking through telephone calls to a community referral system—211 network—has been used as a proxy for food insecurity but the context of these calls has not been characterized and the validity of this proxy measure is unknown.ObjectiveTo investigate the content of food-related telephone calls to 211 and explore the indications of food insecurity during these calls.MethodsWe conducted a secondary qualitative analysis on the transcripts of food-related calls to Utah’s 211. From February to March 2022, 25 calls were sampled based on the location of callers to ensure the representation of rural residents. 13 calls from metropolitan and 12 calls from nonmetropolitan ZIP Codes were included. Using a purposive sampling approach, we also made sure that the sample varied with regard to race and ethnicity. Calls were transcribed and de-identified by our community partner—Utah’s 211 and were analyzed using a thematic analysis approach by our research team.ResultsThree themes emerged from the qualitative analysis including referral to 211, reasons for food-related calls, and reasons for unmet food needs. Results highlight the complex social environment around 211 food-related callers, lack of knowledge about available food resources, and indications of food insecurity in calls.ConclusionInformation seeking for food-related resources through 211 is a problem-solving source for people living in a complex social environment. Indications of food insecurity through these calls validate the use of these calls as a proxy measure for food insecurity. Interventions should be designed to increase awareness about the available resources and address the co-existing social needs with food insecurity.

  16. f

    Data from: HOW TO PERFORM A META-ANALYSIS: A PRACTICAL STEP-BY-STEP GUIDE...

    • datasetcatalog.nlm.nih.gov
    • scielo.figshare.com
    Updated May 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Helito, Camilo Partezani; Gonçalves, Romeu Krause; de Lima, Lana Lacerda; Clazzer, Renata; de Lima, Diego Ariel; de Camargo, Olavo Pires (2022). HOW TO PERFORM A META-ANALYSIS: A PRACTICAL STEP-BY-STEP GUIDE USING R SOFTWARE AND RSTUDIO [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000403452
    Explore at:
    Dataset updated
    May 27, 2022
    Authors
    Helito, Camilo Partezani; Gonçalves, Romeu Krause; de Lima, Lana Lacerda; Clazzer, Renata; de Lima, Diego Ariel; de Camargo, Olavo Pires
    Description

    ABSTRACT Meta-analysis is an adequate statistical technique to combine results from different studies, and its use has been growing in the medical field. Thus, not only knowing how to interpret meta-analysis, but also knowing how to perform one, is fundamental today. Therefore, the objective of this article is to present the basic concepts and serve as a guide for conducting a meta-analysis using R and RStudio software. For this, the reader has access to the basic commands in the R and RStudio software, necessary for conducting a meta-analysis. The advantage of R is that it is a free software. For a better understanding of the commands, two examples were presented in a practical way, in addition to revising some basic concepts of this statistical technique. It is assumed that the data necessary for the meta-analysis has already been collected, that is, the description of methodologies for systematic review is not a discussed subject. Finally, it is worth remembering that there are many other techniques used in meta-analyses that were not addressed in this work. However, with the two examples used, the article already enables the reader to proceed with good and robust meta-analyses. Level of Evidence V, Expert Opinion.

  17. e

    Data from: MGVB: a new proteomics toolset for fast and efficient data...

    • ebi.ac.uk
    Updated Nov 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Metodi Metodiev (2024). MGVB: a new proteomics toolset for fast and efficient data analysis [Dataset]. https://www.ebi.ac.uk/pride/archive/projects/PXD051331
    Explore at:
    Dataset updated
    Nov 15, 2024
    Authors
    Metodi Metodiev
    Variables measured
    Proteomics
    Description

    MGVB is a collection of tools for proteomics data analysis. It covers data processing from in silico digestion of protein sequences to comprehensive identification of postranslational modifications and solving the protein inference problem. The toolset is developed with efficiency in mind. It enables analysis at a fraction of the resources cost typically required by existing commercial and free tools. MGVB, as it is a native application, is much faster than existing proteomics tools such as MaxQuant and MSFragger and, in the same time, finds very similar, in some cases even larger number of peptides at a chosen level of statistical significance. It implements a probabilistic scoring function to match spectra to sequences, and a novel combinatorial search strategy for finding post-translational modifications, and a Bayesian approach to locate modification sites. This report describes the algorithms behind the tools, presents benchmarking data sets analysis comparing MGVB performance to MaxQuant/Andromeda, and provides step by step instructions for using it in typical analytical scenarios. The toolset is provided free to download and use for academic research and in software projects, but is not open source at the present. It is the intention of the author that it will be made open source in the near future—following rigorous evaluations and feedback from the proteomics research community.

  18. m

    Data from: Attention Allocation to Projection Level Alleviates...

    • data.mendeley.com
    Updated May 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yang Cai (2024). Attention Allocation to Projection Level Alleviates Overconfidence in Situation Awareness [Dataset]. http://doi.org/10.17632/jb5j2rczjz.1
    Explore at:
    Dataset updated
    May 28, 2024
    Authors
    Yang Cai
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains several files related to our research paper titled "Attention Allocation to Projection Level Alleviates Overconfidence in Situation Awareness". These files are intended to provide a comprehensive overview of the data analysis process and the presentation of results. Below is a list of the files included and a brief description of each:

    R Scripts: These are scripts written in the R programming language for data processing and analysis. The scripts detail the steps for data cleaning, transformation, statistical analysis, and the visualization of results. To replicate the study findings or to conduct further analyses on the dataset, users should run these scripts.

    R Markdown File: Offers a dynamic document that combines R code with rich text elements such as paragraphs, headings, and lists. This file is designed to explain the logic and steps of the analysis in detail, embedding R code chunks and the outcomes of code execution. It serves as a comprehensive guide to understanding the analytical process behind the study.

    HTML File: Generated from the R Markdown file, this file provides an interactive report of the results that can be viewed in any standard web browser. For those interested in browsing the study's findings without delving into the specifics of the analysis, this HTML file is the most convenient option. It presents the final analysis outcomes in an intuitive and easily understandable manner. For optimal viewing, we recommend opening the HTML file with the latest version of Google Chrome or any other modern web browser. This approach ensures that all interactive functionalities are fully operational.

    Together, these files form a complete framework for the research analysis, aimed at enhancing the transparency and reproducibility of the study.

  19. d

    Data release for solar-sensor angle analysis subset associated with the...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Data release for solar-sensor angle analysis subset associated with the journal article "Solar and sensor geometry, not vegetation response, drive satellite NDVI phenology in widespread ecosystems of the western United States" [Dataset]. https://catalog.data.gov/dataset/data-release-for-solar-sensor-angle-analysis-subset-associated-with-the-journal-article-so
    Explore at:
    Dataset updated
    Nov 27, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Western United States, United States
    Description

    This dataset provides geospatial location data and scripts used to analyze the relationship between MODIS-derived NDVI and solar and sensor angles in a pinyon-juniper ecosystem in Grand Canyon National Park. The data are provided in support of the following publication: "Solar and sensor geometry, not vegetation response, drive satellite NDVI phenology in widespread ecosystems of the western United States". The data and scripts allow users to replicate, test, or further explore results. The file GrcaScpnModisCellCenters.csv contains locations (latitude-longitude) of all the 250-m MODIS (MOD09GQ) cell centers associated with the Grand Canyon pinyon-juniper ecosystem that the Southern Colorado Plateau Network (SCPN) is monitoring through its land surface phenology and integrated upland monitoring programs. The file SolarSensorAngles.csv contains MODIS angle measurements for the pixel at the phenocam location plus a random 100 point subset of pixels within the GRCA-PJ ecosystem. The script files (folder: 'Code') consist of 1) a Google Earth Engine (GEE) script used to download MODIS data through the GEE javascript interface, and 2) a script used to calculate derived variables and to test relationships between solar and sensor angles and NDVI using the statistical software package 'R'. The file Fig_8_NdviSolarSensor.JPG shows NDVI dependence on solar and sensor geometry demonstrated for both a single pixel/year and for multiple pixels over time. (Left) MODIS NDVI versus solar-to-sensor angle for the Grand Canyon phenocam location in 2018, the year for which there is corresponding phenocam data. (Right) Modeled r-squared values by year for 100 randomly selected MODIS pixels in the SCPN-monitored Grand Canyon pinyon-juniper ecosystem. The model for forward-scatter MODIS-NDVI is log(NDVI) ~ solar-to-sensor angle. The model for back-scatter MODIS-NDVI is log(NDVI) ~ solar-to-sensor angle + sensor zenith angle. Boxplots show interquartile ranges; whiskers extend to 10th and 90th percentiles. The horizontal line marking the average median value for forward-scatter r-squared (0.835) is nearly indistinguishable from the back-scatter line (0.833). The dataset folder also includes supplemental R-project and packrat files that allow the user to apply the workflow by opening a project that will use the same package versions used in this study (eg, .folders Rproj.user, and packrat, and files .RData, and PhenocamPR.Rproj). The empty folder GEE_DataAngles is included so that the user can save the data files from the Google Earth Engine scripts to this location, where they can then be incorporated into the r-processing scripts without needing to change folder names. To successfully use the packrat information to replicate the exact processing steps that were used, the user should refer to packrat documentation available at https://cran.r-project.org/web/packages/packrat/index.html and at https://www.rdocumentation.org/packages/packrat/versions/0.5.0. Alternatively, the user may also use the descriptive documentation phenopix package documentation, and description/references provided in the associated journal article to process the data to achieve the same results using newer packages or other software programs.

  20. s

    Data from: Data files used to study change dynamics in software systems

    • figshare.swinburne.edu.au
    pdf
    Updated Jul 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rajesh Vasa (2024). Data files used to study change dynamics in software systems [Dataset]. http://doi.org/10.25916/sut.26288227.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jul 22, 2024
    Dataset provided by
    Swinburne
    Authors
    Rajesh Vasa
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    It is a widely accepted fact that evolving software systems change and grow. However, it is less well-understood how change is distributed over time, specifically in object oriented software systems. The patterns and techniques used to measure growth permit developers to identify specific releases where significant change took place as well as to inform them of the longer term trend in the distribution profile. This knowledge assists developers in recording systemic and substantial changes to a release, as well as to provide useful information as input into a potential release retrospective. However, these analysis methods can only be applied after a mature release of the code has been developed. But in order to manage the evolution of complex software systems effectively, it is important to identify change-prone classes as early as possible. Specifically, developers need to know where they can expect change, the likelihood of a change, and the magnitude of these modifications in order to take proactive steps and mitigate any potential risks arising from these changes. Previous research into change-prone classes has identified some common aspects, with different studies suggesting that complex and large classes tend to undergo more changes and classes that changed recently are likely to undergo modifications in the near future. Though the guidance provided is helpful, developers need more specific guidance in order for it to be applicable in practice. Furthermore, the information needs to be available at a level that can help in developing tools that highlight and monitor evolution prone parts of a system as well as support effort estimation activities. The specific research questions that we address in this chapter are: (1) What is the likelihood that a class will change from a given version to the next? (a) Does this probability change over time? (b) Is this likelihood project specific, or general? (2) How is modification frequency distributed for classes that change? (3) What is the distribution of the magnitude of change? Are most modifications minor adjustments, or substantive modifications? (4) Does structural complexity make a class susceptible to change? (5) Does popularity make a class more change-prone? We make recommendations that can help developers to proactively monitor and manage change. These are derived from a statistical analysis of change in approximately 55000 unique classes across all projects under investigation. The analysis methods that we applied took into consideration the highly skewed nature of the metric data distributions. The raw metric data (4 .txt files and 4 .log files in a .zip file measuring ~2MB in total) is provided as a comma separated values (CSV) file, and the first line of the CSV file contains the header. A detailed output of the statistical analysis undertaken is provided as log files generated directly from Stata (statistical analysis software).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jason Porzelius (2023). Google Certificate BellaBeats Capstone Project [Dataset]. https://www.kaggle.com/datasets/jasonporzelius/google-certificate-bellabeats-capstone-project
Organization logo

Google Certificate BellaBeats Capstone Project

Explore at:
zip(169161 bytes)Available download formats
Dataset updated
Jan 5, 2023
Authors
Jason Porzelius
Description

Introduction: I have chosen to complete a data analysis project for the second course option, Bellabeats, Inc., using a locally hosted database program, Excel for both my data analysis and visualizations. This choice was made primarily because I live in a remote area and have limited bandwidth and inconsistent internet access. Therefore, completing a capstone project using web-based programs such as R Studio, SQL Workbench, or Google Sheets was not a feasible choice. I was further limited in which option to choose as the datasets for the ride-share project option were larger than my version of Excel would accept. In the scenario provided, I will be acting as a Junior Data Analyst in support of the Bellabeats, Inc. executive team and data analytics team. This combined team has decided to use an existing public dataset in hopes that the findings from that dataset might reveal insights which will assist in Bellabeat's marketing strategies for future growth. My task is to provide data driven insights to business tasks provided by the Bellabeats, Inc.'s executive and data analysis team. In order to accomplish this task, I will complete all parts of the Data Analysis Process (Ask, Prepare, Process, Analyze, Share, Act). In addition, I will break each part of the Data Analysis Process down into three sections to provide clarity and accountability. Those three sections are: Guiding Questions, Key Tasks, and Deliverables. For the sake of space and to avoid repetition, I will record the deliverables for each Key Task directly under the numbered Key Task using an asterisk (*) as an identifier.

Section 1 - Ask:

A. Guiding Questions:
1. Who are the key stakeholders and what are their goals for the data analysis project? 2. What is the business task that this data analysis project is attempting to solve?

B. Key Tasks: 1. Identify key stakeholders and their goals for the data analysis project *The key stakeholders for this project are as follows: -Urška Sršen and Sando Mur - co-founders of Bellabeats, Inc. -Bellabeats marketing analytics team. I am a member of this team.

  1. Identify the business task. *The business task is: -As provided by co-founder Urška Sršen, the business task for this project is to gain insight into how consumers are using their non-BellaBeats smart devices in order to guide upcoming marketing strategies for the company which will help drive future growth. Specifically, the researcher was tasked with applying insights driven by the data analysis process to 1 BellaBeats product and presenting those insights to BellaBeats stakeholders.

Section 2 - Prepare:

A. Guiding Questions: 1. Where is the data stored and organized? 2. Are there any problems with the data? 3. How does the data help answer the business question?

B. Key Tasks:

  1. Research and communicate the source of the data, and how it is stored/organized to stakeholders. *The data source used for our case study is FitBit Fitness Tracker Data. This dataset is stored in Kaggle and was made available through user Mobius in an open-source format. Therefore, the data is public and available to be copied, modified, and distributed, all without asking the user for permission. These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk reportedly (see credibility section directly below) between 03/12/2016 thru 05/12/2016.
    *Reportedly (see credibility section directly below), thirty eligible Fitbit users consented to the submission of personal tracker data, including output related to steps taken, calories burned, time spent sleeping, heart rate, and distance traveled. This data was broken down into minute, hour, and day level totals. This data is stored in 18 CSV documents. I downloaded all 18 documents into my local laptop and decided to use 2 documents for the purposes of this project as they were files which had merged activity and sleep data from the other documents. All unused documents were permanently deleted from the laptop. The 2 files used were: -sleepDay_merged.csv -dailyActivity_merged.csv

  2. Identify and communicate to stakeholders any problems found with the data related to credibility and bias. *As will be more specifically presented in the Process section, the data seems to have credibility issues related to the reported time frame of the data collected. The metadata seems to indicate that the data collected covered roughly 2 months of FitBit tracking. However, upon my initial data processing, I found that only 1 month of data was reported. *As will be more specifically presented in the Process section, the data has credibility issues related to the number of individuals who reported FitBit data. Specifically, the metadata communicates that 30 individual users agreed to report their tracking data. My initial data processing uncovered 33 individual ...

Search
Clear search
Close search
Google apps
Main menu