64 datasets found
  1. Coffee Shop Sales Analysis

    • kaggle.com
    Updated Apr 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Monis Amir (2024). Coffee Shop Sales Analysis [Dataset]. https://www.kaggle.com/datasets/monisamir/coffee-shop-sales-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 25, 2024
    Dataset provided by
    Kaggle
    Authors
    Monis Amir
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Analyzing Coffee Shop Sales: Excel Insights 📈

    In my first Data Analytics Project, I Discover the secrets of a fictional coffee shop's success with my data-driven analysis. By Analyzing a 5-sheet Excel dataset, I've uncovered valuable sales trends, customer preferences, and insights that can guide future business decisions. 📊☕

    DATA CLEANING 🧹

    • REMOVED DUPLICATES OR IRRELEVANT ENTRIES: Thoroughly eliminated duplicate records and irrelevant data to refine the dataset for analysis.

    • FIXED STRUCTURAL ERRORS: Rectified any inconsistencies or structural issues within the data to ensure uniformity and accuracy.

    • CHECKED FOR DATA CONSISTENCY: Verified the integrity and coherence of the dataset by identifying and resolving any inconsistencies or discrepancies.

    DATA MANIPULATION 🛠️

    • UTILIZED LOOKUPS: Used Excel's lookup functions for efficient data retrieval and analysis.

    • IMPLEMENTED INDEX MATCH: Leveraged the Index Match function to perform advanced data searches and matches.

    • APPLIED SUMIFS FUNCTIONS: Utilized SumIFs to calculate totals based on specified criteria.

    • CALCULATED PROFITS: Used relevant formulas and techniques to determine profit margins and insights from the data.

    PIVOTING THE DATA 𝄜

    • CREATED PIVOT TABLES: Utilized Excel's PivotTable feature to pivot the data for in-depth analysis.

    • FILTERED DATA: Utilized pivot tables to filter and analyze specific subsets of data, enabling focused insights. Specially used in “PEAK HOURS” and “TOP 3 PRODUCTS” charts.

    VISUALIZATION 📊

    • KEY INSIGHTS: Unveiled the grand total sales revenue while also analyzing the average bill per person, offering comprehensive insights into the coffee shop's performance and customer spending habits.

    • SALES TREND ANALYSIS: Used Line chart to compute total sales across various time intervals, revealing valuable insights into evolving sales trends.

    • PEAK HOUR ANALYSIS: Leveraged Clustered Column chart to identify peak sales hours, shedding light on optimal operating times and potential staffing needs.

    • TOP 3 PRODUCTS IDENTIFICATION: Utilized Clustered Bar chart to determine the top three coffee types, facilitating strategic decisions regarding inventory management and marketing focus.

    *I also used a Timeline to visualize chronological data trends and identify key patterns over specific times.

    While it's a significant milestone for me, I recognize that there's always room for growth and improvement. Your feedback and insights are invaluable to me as I continue to refine my skills and tackle future projects. I'm eager to hear your thoughts and suggestions on how I can make my next endeavor even more impactful and insightful.

    THANKS TO: WsCube Tech Mo Chen Alex Freberg

    TOOLS USED: Microsoft Excel

    DataAnalytics #DataAnalyst #ExcelProject #DataVisualization #BusinessIntelligence #SalesAnalysis #DataAnalysis #DataDrivenDecisions

  2. R scripts

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    txt
    Updated May 10, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xueying Han (2018). R scripts [Dataset]. http://doi.org/10.6084/m9.figshare.5513170.v3
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 10, 2018
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Xueying Han
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    R scripts in this fileset are those used in the PLOS ONE publication "A snapshot of translational research funded by the National Institutes of Health (NIH): A case study using behavioral and social science research awards and Clinical and Translational Science Awards funded publications." The article can be accessed here: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0196545This consists of all R scripts used for data cleaning, data manipulation, and statistical analysis used in the publication.There are eleven files in total:1. "Step1a.bBSSR.format.grants.and.publications.data.R" combines all bBSSR 2008-2014 grant award data and associated publications downloaded from NIH Reporter. 2. "Step1b.BSSR.format.grants.and.publications.data.R" combines all BSSR-only 2008-2014 grant award data and associated publications downloaded from NIH Reporter. 3. "Step2a.bBSSR.get.pubdates.transl.and.all.grants.R" queries PubMed and downloads associated bBSSR publication data.4. "Step2b.BSSR.get.pubdates.transl.and.all.grants.R" queries PubMed and downloads associated BSSR-only publication data.5. "Step3.summary.stats.R" performs summary statistics6. "Step4.time.to.first.publication.R" performs time to first publication analysis.7. "Step5.time.to.citation.analysis.R" performs time to first citation and time to overall citation analyses.8. "Step6.combine.NIH.iCite.data.R" combines NIH iCite citation data.9. "Step7.iCite.data.analysis.R" performs citation analysis on combined iCite data.10. "Step8.MeSH.descriptors.R" queries PubMed and pulls down all MeSH descriptors for all publications11. "Step9.CTSA.publications.R" compares the percent of translational publications among bBSSR, BSSR-only, and CTSA publications.

  3. n

    Cross-Calibrated Multi-Platform Ocean Surface Wind Vector L3.5A Monthly...

    • podaac.jpl.nasa.gov
    • data.globalchange.gov
    • +1more
    html
    Updated Apr 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PO.DAAC (2024). Cross-Calibrated Multi-Platform Ocean Surface Wind Vector L3.5A Monthly First-Look Analyses [Dataset]. http://doi.org/10.5067/CCF35-01AM1
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Apr 1, 2024
    Dataset provided by
    PO.DAAC
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    SURFACE WINDS
    Description

    This dataset is derived under the Cross-Calibrated Multi-Platform (CCMP) project and contains a value-added monthly mean ocean surface wind and pseudostress to approximate a satellite-only climatological data record. The CCMP datasets combine cross-calibrated satellite winds obtained from Remote Sensing Systems (REMSS) using a Variational Analysis Method (VAM) to produce a high-resolution (0.25 degree) gridded analysis. The CCMP data set includes cross-calibrated satellite winds derived from SSM/I, SSMIS, AMSR-E, TRMM TMI, QuikSCAT, SeaWinds, WindSat and other satellite instruments as they become available from REMSS. REMSS uses a cross-calibrated sea-surface emissivity model function which improves the consistency between wind speed retrievals from microwave radiometers (i.e., SSM/I, SSMIS, AMSR, TMI, WindSat) and those from scatterometers (i.e., QuikSCAT and SeaWinds). The VAM combines these data with in situ measurements and a starting estimate (first guess) of the wind field. The European Center for Medium-Range Weather Forecasts (ECMWF) ERA-40 Reanalysis is used as the first-guess from 1987 to 1998. The ECMWF Operational analysis is used from January 1999 onward. All wind observations and analysis fields are referenced to a height of 10 meters. The ERA-40 can be obtained from the Computation and Information Systems Laboratory (CISL) at the National Center for Atmospheric Research (NCAR): http://rda.ucar.edu/datasets/ds117.0/. The ECMWF Operational analysis can also be obtained from CISL at NCAR: http://rda.ucar.edu/datasets/ds111.1/. Three products are distributed to complete the CCMP dataset series. L3.0 product contains high-resolution analyses every 6-hours. These data are then time averaged over monthly and 5-day periods to derive the L3.5 product. Directions from the L3.0 product are then assigned to the time and location of the passive microwave satellite wind speed observations to derive the L2.5 product. All datasets are distributed on a 0.25 degree cylindrical coordinate grid. This dataset is one in a series of First-Look (FLK) CCMP datasets and is a continuation and expansion of the SSM/I surface wind velocity project that began under the NASA Pathfinder Program. Refinements and upgrades to the FLK version will be incorporated under a new release (date to be determined) known as Late-look (LLK) and may include additional satellite datasets. All satellite surface wind data are obtained from REMSS under the DISCOVER project: Distributed Information Services: Climate/Ocean Products and Visualizations for Earth Research (http://www.discover-earth.org/index.html). The CCMP project is the result of an investigation funded by the NASA Making Earth Science data records for Use in Research Environments (MEaSUREs) program (http://community.eosdis.nasa.gov/measures/). In accordance with the MEaSUREs program, the CCMP datasets are also known as Earth System Data Records (ESDRs). In collaboration with private and government institutions, a team led by Dr. Robert Atlas (PI; proposal originally solicited by REASoN, and currently funded by MEaSURES) has created the CCMP project to provide multi-instrument ocean surface wind velocity ESDRs, with wide ranging research applications in meteorology and oceanography.

  4. m

    First Trust Dorsey Wright Focus 5 ETF Alternative Data Analytics

    • meyka.com
    Updated Oct 8, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Meyka (2025). First Trust Dorsey Wright Focus 5 ETF Alternative Data Analytics [Dataset]. https://meyka.com/stock/FV/alt-data/
    Explore at:
    Dataset updated
    Oct 8, 2025
    Dataset provided by
    Meyka
    Description

    Non-traditional data signals from social media and employment platforms for FV stock analysis

  5. n

    Data from: Macaques preferentially attend to intermediately surprising...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Apr 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shengyi Wu; Tommy Blanchard; Emily Meschke; Richard Aslin; Ben Hayden; Celeste Kidd (2022). Macaques preferentially attend to intermediately surprising information [Dataset]. http://doi.org/10.6078/D15Q7Q
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 26, 2022
    Dataset provided by
    Klaviyo
    University of Minnesota
    University of California, Berkeley
    Yale University
    Authors
    Shengyi Wu; Tommy Blanchard; Emily Meschke; Richard Aslin; Ben Hayden; Celeste Kidd
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Normative learning theories dictate that we should preferentially attend to informative sources, but only up to the point that our limited learning systems can process their content. Humans, including infants, show this predicted strategic deployment of attention. Here we demonstrate that rhesus monkeys, much like humans, attend to events of moderate surprisingness over both more and less surprising events. They do this in the absence of any specific goal or contingent reward, indicating that the behavioral pattern is spontaneous. We suggest this U-shaped attentional preference represents an evolutionarily preserved strategy for guiding intelligent organisms toward material that is maximally useful for learning. Methods How the data were collected: In this project, we collected gaze data of 5 macaques when they watched sequential visual displays designed to elicit probabilistic expectations using the Eyelink Toolbox and were sampled at 1000 Hz by an infrared eye-monitoring camera system. Dataset:

    "csv-combined.csv" is an aggregated dataset that includes one pop-up event per row for all original datasets for each trial. Here are descriptions of each column in the dataset:

    subj: subject_ID = {"B":104, "C":102,"H":101,"J":103,"K":203} trialtime: start time of current trial in second trial: current trial number (each trial featured one of 80 possible visual-event sequences)(in order) seq current: sequence number (one of 80 sequences) seq_item: current item number in a seq (in order) active_item: pop-up item (active box) pre_active: prior pop-up item (actve box) {-1: "the first active object in the sequence/ no active object before the currently active object in the sequence"} next_active: next pop-up item (active box) {-1: "the last active object in the sequence/ no active object after the currently active object in the sequence"} firstappear: {0: "not first", 1: "first appear in the seq"} looks_blank: csv: total amount of time look at blank space for current event (ms); csv_timestamp: {1: "look blank at timestamp", 0: "not look blank at timestamp"} looks_offscreen: csv: total amount of time look offscreen for current event (ms); csv_timestamp: {1: "look offscreen at timestamp", 0: "not look offscreen at timestamp"} time till target: time spent to first start looking at the target object (ms) {-1: "never look at the target"} looks target: csv: time spent to look at the target object (ms);csv_timestamp: look at the target or not at current timestamp (1 or 0) look1,2,3: time spent look at each object (ms) location 123X, 123Y: location of each box (location of the three boxes for a given sequence were chosen randomly, but remained static throughout the sequence) item123id: pop-up item ID (remained static throughout a sequence) event time: total time spent for the whole event (pop-up and go back) (ms) eyeposX,Y: eye position at current timestamp

    "csv-surprisal-prob.csv" is an output file from Monkilock_Data_Processing.ipynb. Surprisal values for each event were calculated and added to the "csv-combined.csv". Here are descriptions of each additional column:

    rt: time till target {-1: "never look at the target"}. In data analysis, we included data that have rt > 0. already_there: {NA: "never look at the target object"}. In data analysis, we included events that are not the first event in a sequence, are not repeats of the previous event, and already_there is not NA. looks_away: {TRUE: "the subject was looking away from the currently active object at this time point", FALSE: "the subject was not looking away from the currently active object at this time point"} prob: the probability of the occurrence of object surprisal: unigram surprisal value bisurprisal: transitional surprisal value std_surprisal: standardized unigram surprisal value std_bisurprisal: standardized transitional surprisal value binned_surprisal_means: the means of unigram surprisal values binned to three groups of evenly spaced intervals according to surprisal values. binned_bisurprisal_means: the means of transitional surprisal values binned to three groups of evenly spaced intervals according to surprisal values.

    "csv-surprisal-prob_updated.csv" is a ready-for-analysis dataset generated by Analysis_Code_final.Rmd after standardizing controlled variables, changing data types for categorical variables for analysts, etc. "AllSeq.csv" includes event information of all 80 sequences

    Empty Values in Datasets:

    There is no missing value in the original dataset "csv-combined.csv". Missing values (marked as NA in datasets) happen in columns "prev_active", "next_active", "already_there", "bisurprisal", "std_bisurprisal", "sq_std_bisurprisal" in "csv-surprisal-prob.csv" and "csv-surprisal-prob_updated.csv". NAs in columns "prev_active" and "next_active" mean that the first or the last active object in the sequence/no active object before or after the currently active object in the sequence. When we analyzed the variable "already_there", we eliminated data that their "prev_active" variable is NA. NAs in column "already there" mean that the subject never looks at the target object in the current event. When we analyzed the variable "already there", we eliminated data that their "already_there" variable is NA. Missing values happen in columns "bisurprisal", "std_bisurprisal", "sq_std_bisurprisal" when it is the first event in the sequence and the transitional probability of the event cannot be computed because there's no event happening before in this sequence. When we fitted models for transitional statistics, we eliminated data that their "bisurprisal", "std_bisurprisal", and "sq_std_bisurprisal" are NAs.

    Codes:

    In "Monkilock_Data_Processing.ipynb", we processed raw fixation data of 5 macaques and explored the relationship between their fixation patterns and the "surprisal" of events in each sequence. We computed the following variables which are necessary for further analysis, modeling, and visualizations in this notebook (see above for details): active_item, pre_active, next_active, firstappear ,looks_blank, looks_offscreen, time till target, looks target, look1,2,3, prob, surprisal, bisurprisal, std_surprisal, std_bisurprisal, binned_surprisal_means, binned_bisurprisal_means. "Analysis_Code_final.Rmd" is the main scripts that we further processed the data, built models, and created visualizations for data. We evaluated the statistical significance of variables using mixed effect linear and logistic regressions with random intercepts. The raw regression models include standardized linear and quadratic surprisal terms as predictors. The controlled regression models include covariate factors, such as whether an object is a repeat, the distance between the current and previous pop up object, trial number. A generalized additive model (GAM) was used to visualize the relationship between the surprisal estimate from the computational model and the behavioral data. "helper-lib.R" includes helper functions used in Analysis_Code_final.Rmd

  6. GlobalHighPM2.5: Big Data Gapless 1 km Global Ground-level PM2.5 Dataset...

    • zenodo.org
    Updated Jul 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jing Wei; Jing Wei; Zhanqing Li; Alexei Lyapustin; Jun Wang; Oleg Dubovik; Joel Schwartz; Lin Sun; Chi Li; Song Liu; Tong Zhu; Zhanqing Li; Alexei Lyapustin; Jun Wang; Oleg Dubovik; Joel Schwartz; Lin Sun; Chi Li; Song Liu; Tong Zhu (2024). GlobalHighPM2.5: Big Data Gapless 1 km Global Ground-level PM2.5 Dataset over Land [Dataset]. http://doi.org/10.5281/zenodo.10081359
    Explore at:
    Dataset updated
    Jul 10, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jing Wei; Jing Wei; Zhanqing Li; Alexei Lyapustin; Jun Wang; Oleg Dubovik; Joel Schwartz; Lin Sun; Chi Li; Song Liu; Tong Zhu; Zhanqing Li; Alexei Lyapustin; Jun Wang; Oleg Dubovik; Joel Schwartz; Lin Sun; Chi Li; Song Liu; Tong Zhu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 11, 2022
    Description

    GlobalHighPM2.5 is one of the series of long-term, full-coverage, global high-resolution and high-quality datasets of ground-level air pollutants over land (i.e., GlobalHighAirPollutants, GHAP). It is generated from big data (e.g., ground-based measurements, satellite remote sensing products, atmospheric reanalysis, and model simulations) using artificial intelligence by considering the spatiotemporal heterogeneity of air pollution.

    This dataset contains input data, analysis codes, and generated dataset used for the following article, and if you use the GlobalHighPM2.5 dataset for related scientific research, please cite the below-listed corresponding reference (Wei et al., NC, 2023):

    Input Data

    Relevant raw data for each figure (compiled into a single sheet within an Excel document) in the manuscript.

    Code

    Relevant Python scripts for replicating and ploting the analysis results in the manuscript, as well as codes for converting data formats.

    Generated Dataset

    Here is the first big data-derived gapless (spatial coverage = 100%) monthly and yearly 1 km (i.e., M1K, and Y1K) global ground-level PM2.5 dataset over land from 2017 to 2022. This dataset yields a high quality with cross-validation coefficient of determination (CV-R2) values of 0.91, 0.97, and 0.98, and root-mean-square errors (RMSEs) of 9.20, 4.15, and 2.77 µg m-3 on the daily, monthly, and annual basises, respectively.

    Due to data volume limitations,

    all (including daily) data for the year 2022 is accessible at: GlobalHighPM2.5 (2022)

    all (including daily) data for the year 2021 is accessible at: GlobalHighPM2.5 (2021)

    all (including daily) data for the year 2020 is accessible at: GlobalHighPM2.5 (2020)

    all (including daily) data for the year 2019 is accessible at: GlobalHighPM2.5 (2019)

    all (including daily) data for the year 2018 is accessible at: GlobalHighPM2.5 (2018)

    all (including daily) data for the year 2017 is accessible at: GlobalHighPM2.5 (2017)

    Continuously updated...

    More air quality datasets of different air pollutants can be found at: https://weijing-rs.github.io/product.html

  7. f

    Data from: Data-Driven First-Principles Methods for the Study and Design of...

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    • +1more
    txt
    Updated Jun 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhi Deng; Zhuoying Zhu; Iek-Heng Chu; Shyue Ping Ong (2023). Data-Driven First-Principles Methods for the Study and Design of Alkali Superionic Conductors [Dataset]. http://doi.org/10.1021/acs.chemmater.6b02648.s002
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    ACS Publications
    Authors
    Zhi Deng; Zhuoying Zhu; Iek-Heng Chu; Shyue Ping Ong
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    We present a detailed exposition of how first-principles methods can be used to guide alkali superionic conductor (ASIC) study and design. Using the argyrodite Li6PS5Cl as a case study, we demonstrate how modern information technology (IT) infrastructure and software tools can facilitate the assessment of alkali superionic conductors in terms of various critical properties of interest such as phase and electrochemical stability and ionic conductivity. The emphasis is on well-documented, reproducible analysis code that can be readily generalized to other material systems and design problems. For our chosen Li6PS5Cl case study material, we show that Li excess is crucial to enhancing its conductivity by increasing the occupancy of interstitial sites that promote long-range Li+ diffusion between cage-like frameworks. The predicted room-temperature conductivities and activation barriers are in reasonably good agreement with experimental values.

  8. Fitabase data Google Certificate Capstone Project

    • kaggle.com
    zip
    Updated Feb 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kalyani Divakar (2023). Fitabase data Google Certificate Capstone Project [Dataset]. https://www.kaggle.com/datasets/kalyanidivakar/fitabase-data-google-certificate-capstone-project/code
    Explore at:
    zip(419665 bytes)Available download formats
    Dataset updated
    Feb 18, 2023
    Authors
    Kalyani Divakar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Case Study: How Can a Wellness Technology Company Play It Smart?

    This is my first case study as a data analyst using Excel, Tableau, and R. This case study is a part of my Google Data Analytics Professional Certification. I know there may be some insights presented differently or any insights might not be covered as per the point of view of the reader who can provide feedback. Feedback will be appreciated.

    Scenario: The Bellabeat data analysis case study! In this case, the study is to perform the real-world tasks of a junior data analyst. Bellabeat is a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the global smart device market. Urška Sršen, co-founder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. You have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights you discover will then help guide marketing strategy for the company and present analysis to the Bellabeat executive team along with your high-level recommendations for Bellabeat’s marketing strategy. The Case Study Roadmap followed, In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Ask: Sršen asks you to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices. She then wants you to select one Bellabeat product to apply these insights to in your presentation. These questions will guide your analysis: 1. What are some trends in smart device usage? 2. How could these trends apply to Bellabeat customers? 3. How could these trends help influence Bellabeat's marketing strategy? To produce a report with the following deliverables: 1. A clear summary of the business task 2. A description of all data sources used 3. Documentation of any cleaning or manipulation of data 4. A summary of your analysis 5. Supporting visualizations and key findings 6. Your top high-level content recommendations based on your analysis Prepare: includes Dataset used, Accessibility and privacy of data, Information about our dataset, Data organization and verification, Data credibility and integrity. The dataset used for analysis is from Kaggle, which is considered a reliable source. Dataset owner Sršen encourages to use of public data that explores smart device users’ daily habits. She points you to a specific data set: Fitbit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius): This Kaggle data set contains personal fitness trackers from thirty Fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits. Sršen tells that this data set might have some limitations, and encourages us to consider adding other data to help address those limitations by beginning to work more with this data. But, this analysis only confined primarily to the present dataset and has not yet been done analysis by adding other data to address any limitations of this dataset. I may take up later to collect additional datasets based on the availability of those datasets for individual analyst circumstances since companies provide datasets that are needed, may be available on a subscription basis or I need to search and access for similar product datasets. That is my limitation to confine my analysis to this dataset only. Process Phase: 1. Tools used for Analysis: Excel, Tableau, R studio, Kaggle 2. Cleaning of Data: includes removal of duplication of data but data itself by its nature includes Id, dates include repetition and also there are zero values by nature of recording since human beings are body and mind are complex, so the possibility of zero values inherent in data or any other reason yet to be known but an analysis done based on available data though which is not correct for live projects where someone available to discuss them. 3. Analysis was done based on available variables. Analyze Phase: Id Avg.VeryActiveDistance Avg.ModerateActiveDistance Avg.LightActiveDistance
    TotalDistance Avg.Calories 1927972279 0.09580645 0.031290323 0.050709677
    2026352035 0.006129032 0.011290322 3.43612904
    3977333714 1.614999982 2.75099979 3.134333344
    8053475328 8.514838742 0.423870965 2.533870955
    8877689391 6.637419362 0.337741935 6.188709674 3420.258065 409.5...

  9. u

    Exploring Human-Centered Learning Analytics / Artificial Intelligence Tools...

    • portaldelaciencia.uva.es
    • zenodo.org
    Updated 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jesús Brezmes Gil-Albarellos; Ortega-Arranz, Alejandro; Rodríguez Triana, María Jesús; Jesús Brezmes Gil-Albarellos; Ortega-Arranz, Alejandro; Rodríguez Triana, María Jesús (2025). Exploring Human-Centered Learning Analytics / Artificial Intelligence Tools for Educational Purposes [Dataset]. https://portaldelaciencia.uva.es/documentos/688b604717bb6239d2d4aa58
    Explore at:
    Dataset updated
    2025
    Authors
    Jesús Brezmes Gil-Albarellos; Ortega-Arranz, Alejandro; Rodríguez Triana, María Jesús; Jesús Brezmes Gil-Albarellos; Ortega-Arranz, Alejandro; Rodríguez Triana, María Jesús
    Description

    This dissertation builds upon a systematic literature review conducted by my supervisors (Topali, P., Ortega-Arranz, A., Rodríguez-Trianaet al. Designing human-centered learning analytics and artificial intelligence in education solutions: a systematic literature review. 2024), which focused on empirical studies involving the design, development, and implementation of human-centered learning analytics and artificial intelligence tools in education. From the articles included in that review, a subset of relevant tools was identified through an initial screening process. These tools were then analyzed to better understand their features and provide stakeholders with a structured overview of available systems that align with human-centered principles in Learning Analytics.

    The methodology adopted for this work was inspired by DESMET, enabling a systematic and structured evaluation of tool characteristics. The process involved iterative reading of associated literature, hands-on exploration of the tools, and direct communication with the authors or developers to validate the extracted information. A deductive, top-down classification approach was initially used to define broad categories covering both qualitative and quantitative aspects of the tools. These categories were progressively refined through multiple iterations, following an inductive, bottom-up strategy to enhance internal coherence and thematic clarity.

    Two rounds of outreach were conducted to validate the collected data, with several authors providing valuable feedback that was incorporated into the final analysis. The outcome is a categorised, feature-based overview of HCLA and HCAI tools, offering practical insights for researchers and practitioners seeking to adopt or further explore human-centered approaches in educational technology.

    Keywords: Human-Centered Learning Analytics, Human-Centered Artificial Intelligence, Learning Analytics, Human-Centered Design, DESMET, Systematic Analysis, Feature Analysis

    We defined a number of categories that capture the most important characteristics of each tool. For clarity, we organised the categories into five main groups:

    Tool Basis - Includes fundamental information about the tool, such as its name, purpose, pedagogical context, and technological aspects. Human-Centered Approach - Focuses on how stakeholders (e.g., students, teachers, researchers) are involved in the tool's development, use, and feedback processes. Data Management - Covers aspects related to data collection, storage, processing, and privacy considerations. Tool Evaluation - Examines how the tool's effectiveness, usability, and impact are assessed. Tool Adoption - Explores the extent to which the tool is used, its sustainability, and its integration into educational settings.

    Te data is organized into four groups of tabs, each one is formed by the 5 groups mentioned before, to facilitate the management and analysis of the information collectedduring the methodological process (20 in total).1. The first 5 sheets, contain the information collected directly by myself after completingthe iterations described in the methodology process; that is, the raw data.2. The next 5, include tabs with comments and feedback provided by the authors.3. The next 5, comprise tabs where the authors’ contributions have been integratedinto the original data mentioned in point 1, thus combining both sources.4. The last 5 contain normalized or standardized data prepared for thequantitative and qualitative analyses carried out in this study.

  10. n

    Data from: Accommodating the role of site memory in dynamic species...

    • data.niaid.nih.gov
    • search.dataone.org
    • +3more
    zip
    Updated May 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Graziella DiRenzo; David Miller; Blake Hossack; Brent Sigafus; Paige Howell; Erin Muths; Evan Grant (2021). Accommodating the role of site memory in dynamic species distribution models [Dataset]. http://doi.org/10.5061/dryad.vdncjsxs7
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 3, 2021
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Pennsylvania State University
    Authors
    Graziella DiRenzo; David Miller; Blake Hossack; Brent Sigafus; Paige Howell; Erin Muths; Evan Grant
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    First-order dynamic occupancy models (FODOMs) are a class of state-space model in which the true state (occurrence) is observed imperfectly. An important assumption of FODOMs is that site dynamics only depend on the current state and that variations in dynamic processes are adequately captured with covariates or random effects. However, it is often difficult to understand and/or measure the covariates that generate ecological data, which are typically spatio-temporally correlated. Consequently, the non-independent error structure of correlated data causes underestimation of parameter uncertainty and poor ecological inference. Here, we extend the FODOM framework with a second-order Markov process to accommodate site memory when covariates are not available. Our modeling framework can be used to make reliable inference about site occupancy, colonization, extinction, turnover, and detection probabilities. We present a series of simulations to illustrate the data requirements and model performance. We then applied our modeling framework to 13 years of data from an amphibian community in southern Arizona, USA. In this analysis, we found residual temporal autocorrelation of population processes for most species, even after accounting for long-term drought dynamics. Our approach represents a valuable advance in obtaining inference on population dynamics, especially as they relate to metapopulations.

    Methods

    These files were written by: G. V. DiRenzo

    If you have any questions, please email: grace.direnzo@gmail.com

    This repository provides the code, data, and simulations to recreate all of the analysis, tables, and figures presented in the manuscript.

    In this file, we direct the user to the location of files.

    All methods can be found in the manuscript and associated supplements.

    All file paths direct the user in navigating the files in this repo.

    ######## Objective & Table of contents

    File objectives & Table of contents:

    # 1. To navigate to files explaining how to simulate and analyze data using the main text parameterization
    # 2. To navigate to files explaining how to simulate and analyze data using the alternative parameterization (hidden Markov model)
    # 3. To navigate to files that created the parameter combinations for the simulation studies
    # 4. To navigate to files used to run scenarios in the manuscript
      # 4a. Scenario 1: data generated without site memory & without site heterogenity
      # 4b. Scenario 2: data generated with site memory & without site heterogenity
      # 4c. Scenario 3: data generated with site memory & with site heterogenity
    # 5. To navigate to files for general sample design guidelines
    # 6. Parameter accuracy, precision, and bias under different parameter combinations
    # 7. Model comparison under different scenarios
    # 8. To specifically navigate to code that recreates manuscript:
      # 8a. Figures
      # 8b. Tables
    # 9. To navigate to files for empirical analysis
    
    ### 1. Main text parameterization

    To see model parameterization as written in the main text, please navigate to: /MemModel/OtherCode/MemoryMod_main.R

    ### 2. Alternative parameterization

    To see alternative parameterization using a Hidden Markov Model, please navigate to: /MemModel/OtherCode/MemoryMod_HMM.R

    ### 3. Parameter Combinations

    To see how parameter combinations were generated, please navigate to: /MemModel/ParameterCombinations/LHS_parameter_combos.R

    To see stored parameter combinations for simulations, please navigate to: /MemModel/ParameterCombinations/parameter_combos_MemModel4.csv

    ### 4a. Scenario #1

    To simulate data WITHOUT memory and analyze using: - memory model & - first-order dynamic occupancy model

    Please navigate to: /MemModel/Simulations/withoutMem/Code/ MemoryMod_JobArray_withoutMem.R = code to simulate & analyze data MemoryMod_JA1.sh = file to run simulations 1-5000 on HPC MemoryMod_JA2.sh = file to run simulations 5001-10000 on HPC

    All model output is stored in: /MemModel/Simulations/withoutMem/ModelOutput

    ### 4b. Scenario #2

    To simulate data WITH memory and analyze using: - memory model & - first-order dynamic occupancy model

    Please navigate to: /MemModel/Simulations/withMem/Code/ MemoryMod_JobArray_withMem.R = code to simulate & analyze data MemoryMod_JA1.sh = file to run simulations 1-5000 on HPC MemoryMod_JA2.sh = file to run simulations 5001-10000 on HPC

    All model output is stored in: /MemModel/Simulations/withMem/ModelOutput

    ### 4c. Scenario #3

    To simulate data WITH memory and WITH site heterogenity- analyze using: - memory model & - first-order dynamic occupancy model

    Please navigate to: /MemModel/Simulations/Hetero/Code/ MemoryMod_JobArray_Hetero.R = code to simulate & analyze data MemoryMod_JA1.sh = file to run simulations 1-5000 on HPC MemoryMod_JA2.sh = file to run simulations 5001-10000 on HPC

    All model output is stored in: /MemModel/Simulations/Hetero/ModelOutput

    ### 5. General sample design guidelines

    To see methods for the general sample design guidelines, please navigate to: /MemModel/PostProcessingCode/Sampling_design_guidelines.R

    ### 6. Parameter accuracy, precision, and bias under different parameter combinations

    To see methods for model performance under different parameter combinations, please navigate to: /MemModel/PostProcessingCode/Parameter_precison_accuracy_bias.R

    ### 7. Comparison of model performance

    To see methods for model comparison, please navigate to: /MemModel/PostProcessingCode/ModelComparison.R

    ### 8a. Manuscript Figures

    To create parts of Figure 1 of main text (case study): - Fig 1D & 1E: /MemModel/EmpiricalAnalysis/Code/Analysis/AZ_CaseStudy.R

    To create Figure 2 of main text (Comparison across simulation scenarios): - /MemModel/PostProcessingCode/ModelComparison.R

    To create Figure S1, S2, & S3 use file: - /MemModel/PostProcessingCode/Parameter_precison_accuracy_bias.R

    To create Figure S4 & S5 use file: - /MemModel/PostProcessingCode/ModelComparison.R

    ### 8b. Manuscript Tables

    To create Table 1 of main text (General sampling recommendations): - /MemModel/PostProcessingCode/Sampling_design_guidelines.R

    To create Table S1: - /MemModel/PostProcessingCode/Parameter_precison_accuracy_bias.R

    To create Table S2: - /MemModel/EmpiricalAnalysis/Code/Analysis/AZ_CaseStudy.R

    To create Table S3: - /MemModel/PostProcessingCode/ModelComparison.R

    To create Table S4 & S5: - /MemModel/EmpiricalAnalysis/Code/Analysis/AZ_CaseStudy.R

    ### 9. Empirical analysis

    To recreate the empirical analysis of the case study, please navigate to: - /MemModel/EmpiricalAnalysis/Code/Analysis/AZ_CaseStudy.R

  11. Z

    Smarter open government data for Society 5.0: analysis of 51 OGD portals

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    • +1more
    Updated Aug 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anastasija Nikiforova (2021). Smarter open government data for Society 5.0: analysis of 51 OGD portals [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_5142244
    Explore at:
    Dataset updated
    Aug 4, 2021
    Dataset provided by
    University of Latvia
    Authors
    Anastasija Nikiforova
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains data collected during a study "Smarter open government data for Society 5.0: are your open data smart enough" (Sensors. 2021; 21(15):5204) conducted by Anastasija Nikiforova (University of Latvia). It being made public both to act as supplementary data for "Smarter open government data for Society 5.0: are your open data smart enough" paper and in order for other researchers to use these data in their own work.

    The data in this dataset were collected in the result of the inspection of 60 countries and their OGD portals (total of 51 OGD portal in May 2021) to find out whether they meet the trends of Society 5.0 and Industry 4.0 obtained by conducting an analysis of relevant OGD portals.

    Each portal has been studied starting with a search for a data set of interest, i.e. “real-time”, “sensor” and “covid-19”, follwing by asking a list of additional questions. These questions were formulated on the basis of combination of (1) crucial open (government) data-related aspects, including open data principles, success factors, recent studies on the topic, PSI Directive etc., (2) trends and features of Society 5.0 and Industry 4.0, (3) elements of the Technology Acceptance Model (TAM) and the Unified Theory of Acceptance and Use Model (UTAUT).

    The method used belongs to typical / daily tasks of open data portals sometimes called “usability test” – keywords related to a research question are used to filter data sets, i.e. “real-time”, “real time” and “real time”, “sensor”, covid”, “covid-19”, “corona”, “coronavirus”, “virus”. In most cases, “real-time”, “sensor” and “covid” keywords were sufficient. The examination of the respective aspects for less user-friendly portals was adapted to particular case based on the portal or data set specifics, by checking: 1. are the open data related to the topic under question ({sensor; real-time; Covid-19}) published, i.e. available? 2. are these data available in a machine-readable format? 3. are these data current, i.e. regularly updated? Where the criteria on the currency depends on the nature of data, i.e. Covid-19 data on the number of cases per day is expected to be updated daily, which won’t be sufficient for real-time data as the title supposes etc. 4. is API ensured for these data? having most importance for real-time and sensor data; 5. have they been published in a timely manner? which was verified mainly for Covid-19 related data. The timeliness is assessed by comparing the dates of the first case identified in a given country and the first release of open data on this topic. 6. what is the total number of available data sets? 7. does the open government data portal provides use-cases / showcases?
    8. does the open government portal provide an opportunity to gain insight into the popularity of the data, i.e. does the portal provide statistics of this nature, such as the number of views, downloads, reuses, rating etc.? 9. is there an opportunity to provide a feedback, comment, suggestion or complaint? 10. (9a) is the artifact, i.e. feedback, comment, suggestion or complaint, visible to other users?

    Format of the file .xls, .ods, .csv (for the first spreadsheet only)

    Licenses or restrictions CC-BY

    For more info, see README.txt

  12. m

    Data for: Clustering analysis of countries using the COVID-19 cases dataset

    • data.mendeley.com
    Updated Apr 26, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Efthimios Zervas (2021). Data for: Clustering analysis of countries using the COVID-19 cases dataset [Dataset]. http://doi.org/10.17632/ffrk66tnvf.1
    Explore at:
    Dataset updated
    Apr 26, 2021
    Authors
    Efthimios Zervas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cases of COVID-19 Figure1: per country (per day after the first case and per date), Figure2: clustering of Figure1 data, Figure 3: per country/population (per day after the first case and per date), Figure 4: clustering of Figure3 data, Figure5: per country/population/land area (per day after the first case and per date), Figure 6: clustering of Figure 5 data

  13. m

    BISE Dataset-Balinese Script for Imaginary Spelling using...

    • data.mendeley.com
    Updated Nov 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    I Made Agus Wirawan (2024). BISE Dataset-Balinese Script for Imaginary Spelling using Electroencephalogram Signals [Dataset]. http://doi.org/10.17632/c3m4s2dtcr.2
    Explore at:
    Dataset updated
    Nov 15, 2024
    Authors
    I Made Agus Wirawan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Balinese Script for Imaginary Spelling using Electroencephalogram (BISE dataset) is a collection of data related to pronunciation/spelling and imagination of Balinese script based on electroencephalogram (EEG) signals. This dataset consists of character spelling (CS) and character imagination (CI) datasets. Providing CS and CI datasets is very important to ensure that the EEG signal pattern for the spoken script matches the imagined script. So, this dataset would actually need to be provided. In addition, previous researchers have yet to collect a Balinese script imagination dataset. The participants used as samples in this study were 31 healthy people, 8 males and 23 females, from students of the Balinese language education study program at Universitas Pendidikan Ganesha. The participants' EEG signals were recorded using Contec KT 88 with 16 channels. This dataset consists of 7 types of data, namely: (1) Raw data from the 1st experiment, (2) Raw data from the 2nd experiment, (3) Data analysis of character spelling (CS) in 1st experiment, (4) Data analysis of character imagination (CI) in 1st experiment, (5) Data analysis of character spelling (CS) in 2nd experiment, (6) Data analysis of character imagination (CI) in 2nd experiment, and (7) Raw data from calm conditions. The first experimental data contains EEG signal data from participants experimenting with pronouncing and imagining 18 Balinese scripts sequentially and randomly. Furthermore, the raw data of the second experiment contains EEG signal data from participants who performed the spelling (CS) and imagining (CI) of 6 Balinese vowel scripts sequentially and randomly. Furthermore, from the raw data, a data analysis process was carried out for the raw data in the 1st and 2nd experiments. In the first experiment's raw data, two analysis data were produced for 18 Balinese scripts spelling and 18 Balinese scripts imagined. Furthermore, for the raw data in the 2nd experiment, two analysis data were produced for 6 Balinese vowel scripts that were spelling and 6 Balinese vowel scripts that were imagined. Finally, the raw data of the calm condition contains EEG signal data from participants in a quiet condition before starting the experiment.

  14. Udacity bikeshare Project Solution

    • kaggle.com
    zip
    Updated Aug 24, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abeer Elhussein (2021). Udacity bikeshare Project Solution [Dataset]. https://www.kaggle.com/abeerelhussein/udacity-bikeshare-project-solution
    Explore at:
    zip(24173698 bytes)Available download formats
    Dataset updated
    Aug 24, 2021
    Authors
    Abeer Elhussein
    Description

    The Dataset is a python script file and 3 excel sheets. I used pandas to analyze the data and python by asking the user some questions: which city would he likes to see results bout , then choose a month and day or choose 'all' have all months results . There are some results to be shown and finally ask the user if he want see 5 rows of data . I need A review and evaluation for my first project , Thanks

  15. a

    Michigan Indoor Radon Results

    • hub.arcgis.com
    Updated Mar 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michigan Dept. of Environment, Great Lakes, and Energy (2025). Michigan Indoor Radon Results [Dataset]. https://hub.arcgis.com/datasets/7320389d72d241399c04424f62e214a5
    Explore at:
    Dataset updated
    Mar 25, 2025
    Dataset authored and provided by
    Michigan Dept. of Environment, Great Lakes, and Energy
    Area covered
    Michigan,
    Description

    EGLE’s Michigan Indoor Radon Map was created with data supplied to the state by organization(s) that manufacture and/or analyze short term radon test kits. This tool is intended to help citizens have an understanding of radon concentrations in a given zip code and how indoor radon levels are geologically dependent. It also assists the EGLE radon outreach specialist in identifying areas with low/no testing rates and areas with higher likelihoods of increased indoor radon presence. Data (indoor radon concentration) is collected from organization(s) willing to provide information, which is then filtered to include values associated with first time tests and without active radon mitigation systems. Full analytical data will not be displayed for zip codes with less than 5 test results, only the test kit count will be shown. A zero value indicates insufficient data for analysis and a null value indicates no tests were provided. Statistical analysis has only been done on first time radon tests. It should be noted that in several zip codes total number of tests indicates either first time tests, follow up tests, post repair tests, testing for real estate, or unknown testing. Email radon@michigan.gov for any questions relating to the data.The dataset includes results for approximately the preceding 10 years. Data will be refreshed at a rate no more frequently than monthly, but at least quarterly. A full new dataset will be provided for each transmittal. Download the data behind this map on EGLE's Open Data Portal. View the Web App.Field NameDescriptionZipCodeZIP Code NumberPONamePost Office NameMinimumRadonLevelMinimum value of 1sttime tests*MaximumRadonLevelMaximum value of 1sttime tests*TotalTestsCount of total number of radon tests done*TotalFirstTimeTestsCount of total number of 1sttime radon tests done*AverageRadonLevelMean value of 1sttime tests *Shown on the Map As:< 2 pCi/L (Retest every 2 -5 years)≥ 2 - 3.9 pCi/L (Mitigation Suggested)≥ 4 pCi/L (Mitigation Recommended)MedianRadonLevelMedian value of 1sttime tests*TestsGreaterThan2pCiCount of 1sttime tests greater than 2 pCi/L*TestsGreaterThan4pCiCount of 1sttime tests greater than 4 pCi/L*PercentTestsGreaterThan2pCiNumber of 1st time tests above 2 pCi/L divided by the total number of 1sttime tests*

    PercentTestsGreaterThan4pCi

    Number of 1st time tests above 4 pCi/L divided by the total number of 1sttime tests*

    LastUpdatedDate

    Date of last update. Data does not always change with each update.

    *IMPORTANT NOTE: Full analytical data will not be displayed for zip codes with less than 5 test results, only the test kit count will be shown. A zero value indicates insufficient data for analysis and a null value indicates no tests were provided.Disclaimer: EGLE makes every attempt to ensure data accuracy but cannot guarantee the completeness or accuracy of the information contained within these datasets. The purpose of this map is to assist residents with understanding radon risk potential. This map is not intended to be used to determine if a home in a given area should be tested for radon. Homes with elevated levels of radon have been found in all 83 counties. All homes should be tested regardless of geographic location. ZIP code boundaries do not correspond to legal or administrative boundaries and were originally created for postal delivery purposes. ZIP code boundaries are subject to change over time and are not precise due to the linear nature of postal routes. This dynamic nature makes ZIP codes less reliable for long-term geographical analysis. The USPS does not offer up an authoritative layer for use by mapping professionals. The ZIP code boundaries used in this analysis are sourced from ESRI.

  16. Predictive Employee Attrition Analysis (IBM HR)

    • kaggle.com
    zip
    Updated Nov 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicolas Zalazar (2025). Predictive Employee Attrition Analysis (IBM HR) [Dataset]. https://www.kaggle.com/datasets/nicolaszalazar73/ibm-hr-analytics-predictive-employee-attrition
    Explore at:
    zip(101327 bytes)Available download formats
    Dataset updated
    Nov 3, 2025
    Authors
    Nicolas Zalazar
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Predictive Employee Attrition Analysis (IBM HR Churn)

    Context and Dataset Overview This project focuses on the IBM HR Analytics Employee Attrition & Performance dataset, a classic resource for exploring predictive modeling in Human Resources. The goal is to move beyond simple descriptive statistics to build a functional predictive model that identifies which employees are at risk of leaving and why.

    The dataset includes demographic information, job roles, satisfaction levels, salary metrics, and key tenure data. The core challenge is addressing the high class imbalance and extracting actionable business intelligence from the model coefficients.

    Methodology and Core Contribution This analysis follows a robust Data Science pipeline, providing a complete solution from raw data to executive visualization:

    Data Preprocessing & Feature Engineering: We utilized Python (Pandas/Scikit-learn) to clean the data, impute missing values, and engineer crucial categorical variables (e.g., Seniority_Category, Monthly_Income_Level) to enhance segmentation analysis.

    Predictive Modeling (Logistic Regression): A Logistic Regression model was trained to predict the binary target (Attrition). Crucially, we use the model coefficients as "Churn Drivers" to quantify the influence of each variable on the probability of attrition.

    Executive Visualization: The findings are presented in a comprehensive two-page Looker Studio Dashboard. The dashboard is designed to be actionable, clearly separating the "Why" (Predictive Drivers) from the "Who and Where" (Descriptive Risk Segments).

    Key Project Outcomes Identification of Overtime as the single highest predictor of employee attrition.

    Confirmation of high early-career churn (within the first 5 years of tenure).

    A clear, validated framework for HR teams to prioritize intervention based on quantitative risk factors.

    Technologies Used: Python (Pandas, Scikit-learn), SQL, Looker Studio.

  17. d

    Replication Data for: \"Social Entrepreneurship Measurement Framework for...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vieira, Valéria Gonçalves; Oliveira, Verônica Macário de; Miki, Adriana Fumi Chim (2023). Replication Data for: \"Social Entrepreneurship Measurement Framework for Developing Countries\" published by RAC-Revista de Administração Contemporânea\" [Dataset]. http://doi.org/10.7910/DVN/UGOMJN
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Vieira, Valéria Gonçalves; Oliveira, Verônica Macário de; Miki, Adriana Fumi Chim
    Description

    This study aims to propose and validate with experts a framework with elements for measuring social entrepreneurship for developing countries. The proposed model was designed based on a literature review of entrepreneurship models indexed in Web of Science and Scopus databases. The dimensions associated with social entrepreneurship and their potential analysis categories were identified, composing a preliminary framework of indicators validated by a panel of experts using the Delphi technique. The analysis tool was a four dimensions survey, and their sub-dimensions, resulting in 59 variables. It was available in Portuguese and English, which allowed international participation, and sent via e-mail to the respondents. A Likert-type scale was adopted, ranging from 1 to 7, where 1 is the least important and 7 the most important for a indicator. At the end of each group of questions, an open-ended question was included for suggestions and comments. The Delphi method was implemented in two rounds, and it was established as insertion criteria that at least 80% responses were equal or higher than 5. After the first data analysis round, the indicators were submitted to a second round. Initially, the indicators with consensus equal to or greater than 80% were evaluated, then those that did not reach consensus in the first round. In both cases, the specialist was asked to decide whether the indicator should be included or excluded. The analysis of the responses from the second round was carried out using the same level of consensus in first round (80%), for both inclusion and exclusion of the item in the model. After two rounds of Delphis questionnaires, it was possible to reach the most important indicators for the intended evaluation. Therefore, 46 out of 59 (77.97%) initially proposed indicators were taken into consideration to explain social entrepreneurship in developing countries. The model includes elements of entrepreneurship measurement related to the individual and organizational level, composing four dimensions, namely: Social Entrepreneurial Intention, Social Entrepreneurial Orientation, Processes, and Outcomes. It recognizes that social entrepreneurship in developing countries depends on the social context, which is reflected in the willingness to solve society's problems, generating not only economic value, but also social and environmental value as a result.

  18. f

    Data from: Grayanane Diterpenoids from the Leaves of Rhododendron...

    • acs.figshare.com
    • datasetcatalog.nlm.nih.gov
    txt
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Na Sun; Guijuan Zheng; Meijun He; Yuanyuan Feng; Junjun Liu; Meicheng Wang; Hanqi Zhang; Junfei Zhou; Guangmin Yao (2023). Grayanane Diterpenoids from the Leaves of Rhododendron auriculatum and Their Analgesic Activities [Dataset]. http://doi.org/10.1021/acs.jnatprod.9b00095.s002
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    ACS Publications
    Authors
    Na Sun; Guijuan Zheng; Meijun He; Yuanyuan Feng; Junjun Liu; Meicheng Wang; Hanqi Zhang; Junfei Zhou; Guangmin Yao
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Twenty-four grayanane diterpenoids (1–24) including 12 new ones (1–12) were isolated from Rhododendron auriculatum. The structures of the new grayanane diterpenoids (1–12) were defined via extensive spectroscopic data analysis. The absolute configurations of compounds 2–4, 10–12, 14, and 16 were established by single-crystal X-ray diffraction analysis, and electronic circular dichroism data were used to define the absolute configurations of auriculatols D (8) and E (9). Auriculatol A (1) is the first example of a 5,20-epoxygrayanane diterpenoid bearing a 7-oxabicyclo[4.2.1]nonane motif and a trans/cis/cis/cis-fused 5/5/7/6/5 pentacyclic ring system. Auriculatol B (2) is the first example of a 3α,5α-dihydroxy-1-βH-grayanane diterpenoid. 19-Hydroxy-3-epi-auriculatol B (6) and auriculatol C (7) represent the first examples of 19-hydroxygrayanane and grayan-5(6)-ene diterpenoids, respectively. Diterpenoids 1–24 showed analgesic activities in the writhing test induced by HOAc, and 2, 6, 10, 13, 19, and 24 at a dose of 5.0 mg/kg exhibited significant analgesic effects (inhibition rates >50%). Grayanane diterpenoids grayanotoxins I (19) and IV (24) at doses of 0.2 and 0.04 mg/kg showed more potent analgesic activities than morphine.

  19. Planck High-Redshift Source Candidates Catalog - Dataset - NASA Open Data...

    • data.nasa.gov
    Updated Apr 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Planck High-Redshift Source Candidates Catalog - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/planck-high-redshift-source-candidates-catalog
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    The Planck mission, thanks to its large frequency range and all-sky coverage, has a unique potential for systematically detecting the brightest, and rarest, sub-millimeter sources on the sky, including distant objects in the high-redshift Universe traced by their dust emission. A novel method, based on a component-separation procedure using a combination of Planck and IRAS data, has been validated and characterized on numerous simulations, and applied to select the most luminous cold sub-millimeter sources with spectral energy distributions peaking between 353 and 857GHz at 5-arcminute resolution. A total of 2,151 Planck high-z source candidates (the PHZ list) have been detected in the cleanest 26% of the sky, with flux density at 545 GHz above 500 mJy. Embedded in the cosmic infrared background close to the confusion limit, these high-z candidates exhibit colder colors than their surroundings, consistent with redshifts z greater than 2, assuming a dust temperature of Txgal = 35 K and a spectral index of betaxgal = 1.5. Exhibiting extremely high luminosities, larger than 1014 Lsun, the PHZ objects may be made of multiple galaxies or clumps at high redshift, as suggested by a first statistical analysis based on a comparison with number count models. Furthermore, first follow-up observations obtained from optical to sub-millimeter wavelengths, which can be found in companion papers, have confirmed that this list consists of two distinct populations. A small fraction (around 3%) of the sources have been identified as strongly gravitationally lensed star-forming galaxies at redshift 2 to 4, while the vast majority of the PHZ sources appear as overdensities of dusty star-forming galaxies, having colors consistent with being at z > 2, and may be considered as proto-cluster candidates. The PHZ provides an original sample, which is complementary to the Planck Sunyaev-Zeldovich Catalog (PSZ2); by extending the population of virialized massive galaxy clusters detected below z < 1.5 through their SZ signal to a population of sources at z > 1.5, the PHZ may contain the progenitors of today's clusters. Hence the Planck list of high-redshift source candidates opens a new window on the study of the early stages of structure formation, particularly understanding the intensively star-forming phase at high-z. The compact source detection algorithm used herein requires positive detections simultaneously within a 5-arcminute radius in the 545-GHz excess map, and the 857-, 545-, and 353-GHz cleaned maps. It also requires a non-detection in the 100-GHz cleaned maps, which traces emission from synchrotron sources. A detection is then defined as a local maximum of the signal-to-noise ratio (S/N) above a given threshold in each map, with a spatial separation of at least 5 arcminutes being required between two local maxima. A threshold of S/N > 5 is adopted for detections in the 545-GHz excess map, while this is slightly relaxed to S/N > 3 for detections in the cleaned maps because the constraint imposed by the spatial consistency between detections in all three bands is expected to reinforce the robustness of a simultaneous detection. Concerning the 100-GHz band, the authors adopt a similar threshold by requiring the absence of any local maximum with S/N > 3 within a radius of 5 arcminutes. The HEASARC has changed the names of many of the parameters from those given in the original table. In such cases we have listed the original names in parentheses at the end of the parameter descriptions given below. This table was created by the HEASARC in May 2017 based upon the CDS Catalog J/A+A/596/A100 file phz.dat. This is a service provided by NASA HEASARC .

  20. d

    Data from: Estimating transmission dynamics and serial interval of the first...

    • datadryad.org
    • data.niaid.nih.gov
    • +2more
    zip
    Updated Jun 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khouloud Talmoudi; Mouna Safer; Hejer Letaief; Aicha Hchaichi; Chahida Harizi; Sonia Dhaouadi; Sondes Derouiche; Ilhem Bouaziz; Donia Gharbi; Nourhene Najar; Molka Osman; Ines Cherif; Rym Mlallekh; Oumaima Ben-Ayed; Yosr Ayedi; Leila Bouabid; Souha Bougatef; Nissaf Bouafif Ben-Alaya; Mohamed Kouni Chahed (2020). Estimating transmission dynamics and serial interval of the first wave of COVID-19 infections under different control measures: A statistical analysis in Tunisia from February 29 to May 5, 2020 [Dataset]. http://doi.org/10.5061/dryad.b8gtht799
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 30, 2020
    Dataset provided by
    Dryad
    Authors
    Khouloud Talmoudi; Mouna Safer; Hejer Letaief; Aicha Hchaichi; Chahida Harizi; Sonia Dhaouadi; Sondes Derouiche; Ilhem Bouaziz; Donia Gharbi; Nourhene Najar; Molka Osman; Ines Cherif; Rym Mlallekh; Oumaima Ben-Ayed; Yosr Ayedi; Leila Bouabid; Souha Bougatef; Nissaf Bouafif Ben-Alaya; Mohamed Kouni Chahed
    Time period covered
    May 27, 2020
    Area covered
    Tunisia
    Description

    Background: Describing transmission dynamics of the outbreak and impact of intervention measures are critical to planning responses to future outbreaks and providing timely information to guide policy makers decision. We estimate serial interval (SI) and temporal reproduction number (Rt) of SARS-CoV-2 in Tunisia.

    Methods: We collected data of investigations and contact tracing between March 1, 2020 and May 5, 2020 as well as illness onset data during the period February 29-May 5, 2020 from National Observatory of New and Emerging Diseases of Tunisia. Maximum likelihood (ML) approach is used to estimate dynamics of Rt.

    Results: 491 of infector-infectee pairs were involved, with 14.46% reported pre-symptomatic transmission. SI follows Gamma distribution with mean 5.30 days [95% CI 4.66-5.95] and standard deviation 0.26 [95% CI 0.23-0.30]. Also, we estimated large changes in Rt in response to the combined lockdown interventions. The Rt moves from 3.18 [95% CI 2.73-3.69] to 1.77 [95% CI 1...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Monis Amir (2024). Coffee Shop Sales Analysis [Dataset]. https://www.kaggle.com/datasets/monisamir/coffee-shop-sales-analysis
Organization logo

Coffee Shop Sales Analysis

In my first Data Analytics Project, I Discover the secrets of a fictional coffee

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 25, 2024
Dataset provided by
Kaggle
Authors
Monis Amir
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Analyzing Coffee Shop Sales: Excel Insights 📈

In my first Data Analytics Project, I Discover the secrets of a fictional coffee shop's success with my data-driven analysis. By Analyzing a 5-sheet Excel dataset, I've uncovered valuable sales trends, customer preferences, and insights that can guide future business decisions. 📊☕

DATA CLEANING 🧹

• REMOVED DUPLICATES OR IRRELEVANT ENTRIES: Thoroughly eliminated duplicate records and irrelevant data to refine the dataset for analysis.

• FIXED STRUCTURAL ERRORS: Rectified any inconsistencies or structural issues within the data to ensure uniformity and accuracy.

• CHECKED FOR DATA CONSISTENCY: Verified the integrity and coherence of the dataset by identifying and resolving any inconsistencies or discrepancies.

DATA MANIPULATION 🛠️

• UTILIZED LOOKUPS: Used Excel's lookup functions for efficient data retrieval and analysis.

• IMPLEMENTED INDEX MATCH: Leveraged the Index Match function to perform advanced data searches and matches.

• APPLIED SUMIFS FUNCTIONS: Utilized SumIFs to calculate totals based on specified criteria.

• CALCULATED PROFITS: Used relevant formulas and techniques to determine profit margins and insights from the data.

PIVOTING THE DATA 𝄜

• CREATED PIVOT TABLES: Utilized Excel's PivotTable feature to pivot the data for in-depth analysis.

• FILTERED DATA: Utilized pivot tables to filter and analyze specific subsets of data, enabling focused insights. Specially used in “PEAK HOURS” and “TOP 3 PRODUCTS” charts.

VISUALIZATION 📊

• KEY INSIGHTS: Unveiled the grand total sales revenue while also analyzing the average bill per person, offering comprehensive insights into the coffee shop's performance and customer spending habits.

• SALES TREND ANALYSIS: Used Line chart to compute total sales across various time intervals, revealing valuable insights into evolving sales trends.

• PEAK HOUR ANALYSIS: Leveraged Clustered Column chart to identify peak sales hours, shedding light on optimal operating times and potential staffing needs.

• TOP 3 PRODUCTS IDENTIFICATION: Utilized Clustered Bar chart to determine the top three coffee types, facilitating strategic decisions regarding inventory management and marketing focus.

*I also used a Timeline to visualize chronological data trends and identify key patterns over specific times.

While it's a significant milestone for me, I recognize that there's always room for growth and improvement. Your feedback and insights are invaluable to me as I continue to refine my skills and tackle future projects. I'm eager to hear your thoughts and suggestions on how I can make my next endeavor even more impactful and insightful.

THANKS TO: WsCube Tech Mo Chen Alex Freberg

TOOLS USED: Microsoft Excel

DataAnalytics #DataAnalyst #ExcelProject #DataVisualization #BusinessIntelligence #SalesAnalysis #DataAnalysis #DataDrivenDecisions

Search
Clear search
Close search
Google apps
Main menu