46 datasets found
  1. Synthetic HR Burnout Dataset

    • kaggle.com
    Updated Jun 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anvar Kamaleyev (2025). Synthetic HR Burnout Dataset [Dataset]. https://www.kaggle.com/datasets/ankam6010/synthetic-hr-burnout-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 3, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Anvar Kamaleyev
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset simulates employee-level data for burnout prediction and classification tasks. It can be used for binary classification, exploratory data analysis (EDA), and feature importance exploration.

    📄 Columns Name — Synthetic employee name (for realism, not for ML use).

    Age — Age of the employee.

    Gender — Male or Female.

    JobRole — Job type (Engineer, HR, Manager, etc.).

    Experience — Years of work experience.

    WorkHoursPerWeek — Average number of working hours per week.

    RemoteRatio — % of time spent working remotely (0–100).

    SatisfactionLevel — Self-reported satisfaction (1.0 to 5.0).

    StressLevel — Self-reported stress level (1 to 10).

    Burnout — Target variable. 1 if signs of burnout exist (high stress + low satisfaction + long hours), otherwise 0.

  2. Corporate_work_hours_productivity

    • kaggle.com
    Updated Feb 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SuryaDeepthi (2025). Corporate_work_hours_productivity [Dataset]. https://www.kaggle.com/datasets/suryadeepthi/corporate-work-hours-productivity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 13, 2025
    Dataset provided by
    Kaggle
    Authors
    SuryaDeepthi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains 10,000 records of corporate employees across various departments, focusing on work hours, job satisfaction, and productivity performance. The dataset is designed for exploratory data analysis (EDA), performance benchmarking, and predictive modeling of productivity trends.

    You can conduct EDA and investigate correlations between work hours, remote work, job satisfaction, and productivity. You can create new metrics like efficiency per hour or impact of meetings on productivity. Machine Learning Model: If you want a predictive task, you can use "Productivity_Score" as a regression target (predicting continuous performance scores). Or you can also create a classification problem (e.g., categorize employees into high, medium, or low productivity).

  3. COSMOS: a dataset for Classification Of Stress and workload using multiMOdal...

    • zenodo.org
    Updated Sep 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christoph Anders; Christoph Anders; Sidratul Moontaha; Sidratul Moontaha; Fabian Stolp; Fabian Stolp; Samik Real; Samik Real; Bert Arnrich; Bert Arnrich (2024). COSMOS: a dataset for Classification Of Stress and workload using multiMOdal wearable Sensors [Dataset]. http://doi.org/10.5281/zenodo.7923969
    Explore at:
    Dataset updated
    Sep 25, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Christoph Anders; Christoph Anders; Sidratul Moontaha; Sidratul Moontaha; Fabian Stolp; Fabian Stolp; Samik Real; Samik Real; Bert Arnrich; Bert Arnrich
    Description

    Prolonged stress and high mental workload can have deteriorating long-term effects developing several stress-related diseases. The existing stress detection techniques are often uni-modal and limited to controlled setups. One sensing modality could be unobtrusive but mostly results in unreliable sensor readings, especially in uncontrolled environments. Our study recorded multi-modal physiological signals from twenty-five participants in controlled and uncontrolled environments by performing given and self-chosen tasks of high and low mental demand. In this version, we processed and published a subset of the dataset from six participants while working on the rest. The subset of the data is used to check the feasibility of our study by engineering features from electroencephalography (EEG), photoplethysmography (PPG), electrodermal activity (EDA), and temperature sensor data. Machine learning methods were used for the binary classification of the tasks. Personalized models in the uncontrolled environment achieved a mean classification accuracy of up to 83% while using one of the four labels, unveiling some unintentional mislabeling by participants. In controlled environments, multi-modality improved the accuracy by at least 7%. Generalized machine learning models achieved close to chance-level performances. This work underlines the importance of multi-modal recordings and provides the research community with an experimental paradigm to take studies of mental workload and stress workload and stress out of controlled into uncontrolled environments

  4. w

    Dataset of authors, books and publication dates of book series where authors...

    • workwithdata.com
    Updated Nov 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2024). Dataset of authors, books and publication dates of book series where authors equals Eda Kranakis [Dataset]. https://www.workwithdata.com/datasets/book-series?col=book_series%2Cj0-author%2Cj0-book%2Cj0-publication_date&f=1&fcol0=j0-author&fop0=%3D&fval0=Eda+Kranakis&j=1&j0=books
    Explore at:
    Dataset updated
    Nov 25, 2024
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about book series. It has 1 row and is filtered where the authors is Eda Kranakis. It features 4 columns: authors, books, and publication dates.

  5. o

    PIA Customer Feedback Dataset

    • opendatabay.com
    .undefined
    Updated Jul 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). PIA Customer Feedback Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/1a069a47-d689-40dd-af73-4410a79ebbb4
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 6, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Data Science and Analytics
    Description

    This dataset provides customer reviews for PIA Experience, gathered through web scraping from airlinequality.com. It is specifically designed for data science and analytics applications, offering valuable insights into customer sentiment and feedback. The data is suitable for various analytical tasks, including modelling, predictive analysis, feature engineering, and exploratory data analysis (EDA). Users should note that the data requires an initial cleaning phase due to the presence of null values.

    Columns

    • reviews: Contains individual customer feedback entries pertaining to their experience with PIA. This column features approximately 160 distinct review entries.

    Distribution

    The dataset is provided as a CSV file. While the 'reviews' column contains 160 unique values, the exact total number of rows or records in the dataset is not explicitly detailed. It is structured in a tabular format, making it straightforward for data processing.

    Usage

    This dataset is ideally suited for a variety of applications, including: * Modelling * Predictive analysis * Feature engineering * Exploratory Data Analysis (EDA) * Natural Language Processing (NLP) tasks, such as sentiment analysis or topic modelling.

    Coverage

    The dataset's focus is primarily on customer reviews from the Asia region. It was listed on 17 June 2025, and the content relates specifically to the experiences of customers using PIA.

    License

    CC0

    Who Can Use It

    This dataset is beneficial for a range of users, including: * Data scientists looking to develop predictive models or perform advanced feature engineering. * Data analysts interested in conducting exploratory data analysis to uncover trends and patterns. * Researchers studying customer satisfaction, service quality, or airline industry performance. * Developers working on natural language processing solutions, particularly those focused on text analytics from customer feedback.

    Dataset Name Suggestions

    • PIA Customer Feedback
    • PIA Experience Reviews
    • Airline Customer Sentiment - PIA
    • PIA Passenger Reviews
    • PIA Service Review Data

    Attributes

    Original Data Source: PIA Customer Reviews

  6. Employee Turnover Analytics Dataset

    • kaggle.com
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akshay Hedau (2023). Employee Turnover Analytics Dataset [Dataset]. https://www.kaggle.com/datasets/akshayhedau/employee-turnover-analytics-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Akshay Hedau
    Description

    Portobello Tech is an app innovator that has devised an intelligent way of predicting employee turnover within the company. It periodically evaluates employees' work details including the number of projects they worked upon, average monthly working hours, time spent in the company, promotions in the last 5 years, and salary level. Data from prior evaluations show the employee’s satisfaction at the workplace. The data could be used to identify patterns in work style and their interest to continue to work in the company. The HR Department owns the data and uses it to predict employee turnover. Employee turnover refers to the total number of workers who leave a company over a certain time period.

  7. Detailed Analysis on campus recruitment

    • kaggle.com
    Updated Oct 25, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BANDI SAMUEL 2039426 (2020). Detailed Analysis on campus recruitment [Dataset]. https://www.kaggle.com/bandisamuel2039426/detailed-analysis-on-campus-recruitment/activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 25, 2020
    Dataset provided by
    Kaggle
    Authors
    BANDI SAMUEL 2039426
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    This data set consists of Placement data, of students in a XYZ campus. It includes secondary and higher secondary school percentage and specialisation. It also includes degree specialisation, type and Work experience and salary offers to the placed students we will Analyse what factors are playing a major role in order to select a candidate for job recruitment

  8. o

    Data Science Career Opportunities (USA)

    • opendatabay.com
    .undefined
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Data Science Career Opportunities (USA) [Dataset]. https://www.opendatabay.com/data/ai-ml/6d1c5965-8fb2-4749-a8bd-f1c40861b401
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    United States, Data Science and Analytics
    Description

    This dataset provides valuable insights into the US data science job market, containing detailed job listings scraped from the Indeed web portal on 20th November 2022. It is ideal for those seeking to understand job trends, analyse salary expectations, or develop skills in data analysis, machine learning, and natural language processing. The dataset's purpose is to offer a snapshot of available positions across various data science roles, including data scientists, machine learning engineers, and business analysts. It serves as a rich resource for exploratory data analysis, feature engineering, and predictive modelling tasks.

    Columns

    • Title: The job title of the listed position.
    • Company: The hiring company posting the job.
    • Location: The geographic location of the job within the US.
    • Rating: The rating associated with the job or company.
    • Date: Indicates how long the job had been posted prior to 20th November 2022.
    • Salary: The salary information provided in US Dollars ($). Please note that many entries in this column may be missing as salary details are often not disclosed in job listings.
    • Description: A brief summary description of the job.
    • Links: The direct link to the original job posting on the Indeed platform.
    • Descriptions: The full-length description of the job, encompassing all details found in the complete job posting.

    Distribution

    This dataset is provided as a single data file, typically in CSV format. It comprises 1200 rows (records) and 9 distinct columns. The file name is data_science_jobs_indeed_us.csv.

    Usage

    This dataset is perfectly suited for a variety of analytical tasks and applications: * Data Cleaning and Preparation: Practise handling missing values, especially in the 'Salary' column. * Exploratory Data Analysis (EDA): Discover trends in job titles, company types, and locations. * Feature Engineering: Extract new features from the 'Descriptions' column, such as required skills, education levels, or experience. * Classification and Clustering: Develop models for salary prediction, or perform skill clustering analysis to guide curriculum development. * Text Processing and Natural Language Processing (NLP): Analyse job descriptions to identify common skill demands or industry buzzwords.

    Coverage

    The dataset's geographic scope is limited to job postings within the United States. All data was collected on 20th November 2022, with the 'Date' column providing information on how long each job had been active before this date. The dataset covers a wide range of data science positions, including roles such as data scientist, machine learning engineer, data engineer, business analyst, and data science manager. It is important to note the presence of many missing entries in the 'Salary' column, reflecting common data availability challenges in job listings.

    License

    CCO

    Who Can Use It

    This dataset is an excellent resource for: * Aspiring Data Scientists and Machine Learning Engineers: To sharpen their data cleaning, EDA, and model deployment skills. * Educators and Curriculum Developers: To inform and guide the development of relevant data science and analytics courses based on real-world job market demands. * Job Seekers: To understand the current landscape of data science roles, required skills, and potential salary ranges. * Researchers and Analysts: To glean insights into labour market trends in the data science domain. * Human Resources Professionals: To benchmark job roles, skill requirements, and compensation within the industry.

    Dataset Name Suggestions

    • Indeed US Data Science Job Insights
    • US Data Science Job Market Analysis
    • Data Professional Job Postings (Indeed USA)
    • Data Science Career Opportunities (USA)

    Attributes

    Original Data Source: Data Science Job Postings (Indeed USA)

  9. t

    Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and...

    • test.researchdata.tuwien.ac.at
    bin, csv, json +1
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dilara Çakmak; Dilara Çakmak; Dilara Çakmak; Dilara Çakmak (2025). Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and Performance Analysis [Dataset]. http://doi.org/10.70124/f5t2d-xt904
    Explore at:
    csv, text/markdown, json, binAvailable download formats
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    TU Wien
    Authors
    Dilara Çakmak; Dilara Çakmak; Dilara Çakmak; Dilara Çakmak
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 2025
    Description

    Context and Methodology

    Research Domain:
    The dataset is part of a project focused on retail sales forecasting. Specifically, it is designed to predict daily sales for Rossmann, a chain of over 3,000 drug stores operating across seven European countries. The project falls under the broader domain of time series analysis and machine learning applications for business optimization. The goal is to apply machine learning techniques to forecast future sales based on historical data, which includes factors like promotions, competition, holidays, and seasonal trends.

    Purpose:
    The primary purpose of this dataset is to help Rossmann store managers predict daily sales for up to six weeks in advance. By making accurate sales predictions, Rossmann can improve inventory management, staffing decisions, and promotional strategies. This dataset serves as a training set for machine learning models aimed at reducing forecasting errors and supporting decision-making processes across the company’s large network of stores.

    How the Dataset Was Created:
    The dataset was compiled from several sources, including historical sales data from Rossmann stores, promotional calendars, holiday schedules, and external factors such as competition. The data is split into multiple features, such as the store's location, promotion details, whether the store was open or closed, and weather information. The dataset is publicly available on platforms like Kaggle and was initially created for the Kaggle Rossmann Store Sales competition. The data is made accessible via an API for further analysis and modeling, and it is structured to help machine learning models predict future sales based on various input variables.

    Technical Details

    Dataset Structure:

    The dataset consists of three main files, each with its specific role:

    1. Train:
      This file contains the historical sales data, which is used to train machine learning models. It includes daily sales information for each store, as well as various features that could influence the sales (e.g., promotions, holidays, store type, etc.).

      https://handle.test.datacite.org/10.82556/yb6j-jw41
      PID: b1c59499-9c6e-42c2-af8f-840181e809db
    2. Test2:
      The test dataset mirrors the structure of train.csv but does not include the actual sales values (i.e., the target variable). This file is used for making predictions using the trained machine learning models. It is used to evaluate the accuracy of predictions when the true sales data is unknown.

      https://handle.test.datacite.org/10.82556/jerg-4b84
      PID: 7cbb845c-21dd-4b60-b990-afa8754a0dd9
    3. Store:
      This file provides metadata about each store, including information such as the store’s location, type, and assortment level. This data is essential for understanding the context in which the sales data is gathered.

      https://handle.test.datacite.org/10.82556/nqeg-gy34
      PID: 9627ec46-4ee6-4969-b14a-bda555fe34db

    Data Fields Description:

    • Id: A unique identifier for each (Store, Date) combination within the test set.

    • Store: A unique identifier for each store.

    • Sales: The daily turnover (target variable) for each store on a specific day (this is what you are predicting).

    • Customers: The number of customers visiting the store on a given day.

    • Open: An indicator of whether the store was open (1 = open, 0 = closed).

    • StateHoliday: Indicates if the day is a state holiday, with values like:

      • 'a' = public holiday,

      • 'b' = Easter holiday,

      • 'c' = Christmas,

      • '0' = no holiday.

    • SchoolHoliday: Indicates whether the store is affected by school closures (1 = yes, 0 = no).

    • StoreType: Differentiates between four types of stores: 'a', 'b', 'c', 'd'.

    • Assortment: Describes the level of product assortment in the store:

      • 'a' = basic,

      • 'b' = extra,

      • 'c' = extended.

    • CompetitionDistance: Distance (in meters) to the nearest competitor store.

    • CompetitionOpenSince[Month/Year]: The month and year when the nearest competitor store opened.

    • Promo: Indicates whether the store is running a promotion on a particular day (1 = yes, 0 = no).

    • Promo2: Indicates whether the store is participating in Promo2, a continuing promotion for some stores (1 = participating, 0 = not participating).

    • Promo2Since[Year/Week]: The year and calendar week when the store started participating in Promo2.

    • PromoInterval: Describes the months when Promo2 is active, e.g., "Feb,May,Aug,Nov" means the promotion starts in February, May, August, and November.

    Software Requirements

    To work with this dataset, you will need to have specific software installed, including:

    • DBRepo Authorization: This is required to access the datasets via the DBRepo API. You may need to authenticate with an API key or login credentials to retrieve the datasets.

    • Python Libraries: Key libraries for working with the dataset include:

      • pandas for data manipulation,

      • numpy for numerical operations,

      • matplotlib and seaborn for data visualization,

      • scikit-learn for machine learning algorithms.

    Additional Resources

    Several additional resources are available for working with the dataset:

    1. Presentation:
      A presentation summarizing the exploratory data analysis (EDA), feature engineering process, and key insights from the analysis is provided. This presentation also includes visualizations that help in understanding the dataset’s trends and relationships.

    2. Jupyter Notebook:
      A Jupyter notebook, titled Retail_Sales_Prediction_Capstone_Project.ipynb, is provided, which details the entire machine learning pipeline, from data loading and cleaning to model training and evaluation.

    3. Model Evaluation Results:
      The project includes a detailed evaluation of various machine learning models, including their performance metrics like training and testing scores, Mean Absolute Percentage Error (MAPE), and Root Mean Squared Error (RMSE). This allows for a comparison of model effectiveness in forecasting sales.

    4. Trained Models (.pkl files):
      The models trained during the project are saved as .pkl files. These files contain the trained machine learning models (e.g., Random Forest, Linear Regression, etc.) that can be loaded and used to make predictions without retraining the models from scratch.

    5. sample_submission.csv:
      This file is a sample submission file that demonstrates the format of predictions expected when using the trained model. The sample_submission.csv contains predictions made on the test dataset using the trained Random Forest model. It provides an example of how the output should be structured for submission.

    These resources provide a comprehensive guide to implementing and analyzing the sales forecasting model, helping you understand the data, methods, and results in greater detail.

  10. z

    A Dataset on Unobtrusive Measurement of Cognitive Load and Physiological...

    • zenodo.org
    zip
    Updated Jul 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christoph Anders; Christoph Anders; Sidratul Moontaha; Sidratul Moontaha; Samik Real; Samik Real; Bert Arnrich; Bert Arnrich (2024). A Dataset on Unobtrusive Measurement of Cognitive Load and Physiological Signals (EEG, PPG, EDA) in Uncontrolled Environments. [Dataset]. http://doi.org/10.5281/zenodo.10371068
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 29, 2024
    Dataset provided by
    Christoph Anders, Sidratul Moontaha, Samik Real, and Bert Arnrich
    Authors
    Christoph Anders; Christoph Anders; Sidratul Moontaha; Sidratul Moontaha; Samik Real; Samik Real; Bert Arnrich; Bert Arnrich
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset (approximately 315 hours in total) consists of physiological signals from wearable electroencephalography (EEG), electrodermal activity (EDA), photoplethysmogram (PPG), acceleration, and temperature sensors. The recorded dataset is curated from 24 participants following an eight-hour cognitive load elicitation paradigm. The mentioned consumer-grade physiological signals are obtained from the Muse S EEG headband and Empatica E4 wristband. The data is balanced across controlled and uncontrolled environments and high vs. low mental workload levels. During the study, participants worked on mental arithmetic, Stroop, N-Back, and Sudoku tasks in the controlled environment (roughly half of the data) and realistic home-office tasks such as researching, programming, and writing emails in uncontrolled environments. Data labels were obtained using Likert scales, Affective Sliders, PANAS, and NASA-TLX questionnaires. The completely anonymized data set and its publicly available features open a vast potential to the research community working on mental workload detection using consumer-grade wearable sensors. Among others, the data is suitable for developing real-time cognitive load detection methods, research on signal processing techniques for challenging environments, developing artifact removal techniques from low-cost wearable devices' data, or developing personal mental workload assistants.

    The link to the publication of 'Unobtrusive measurement of cognitive load and physiological signals in uncontrolled environments' will be added once the data descriptor was accepted in the respective journal.

  11. Kaggle DS Survey 2019

    • kaggle.com
    Updated Dec 1, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alan Asri (2019). Kaggle DS Survey 2019 [Dataset]. https://www.kaggle.com/datasets/alanasri/kaggle-ds-survey-2019
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 1, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Alan Asri
    Description

    Context

    This notebook contains a thorough analysis and explanation related to the survey conducted by Kaggle. The survey was conducted on respondents from work backgrounds, age variations, where they lived, the companies where they worked. Survey questions contain about the world of the field they work in related to Data Scient and Machine Learning.

    Content

    The following Explanatory Data Analysis is taking data from survey results conducted by Kaggle in 2019 on respondents who give questions about Mechine Learning and Data Scients. Some core points that are in this analysis are as follows, 1. Graph Distribution Age with Formal Education 2. Plot Graph Company and Spent Money in Mechine Learning 3. Comparison spent cost level in Mechine Learning by each company 4. Data Scientist Experience & Their Compensation 5. Correlation between Mechine Learning Experience and Salary benefit 6. Correlation Data Scientist with his Compensation 7. Favourite Media source on Data Scients Topic 8. Favourite media by Age Distribution, Most Likely media by Data Scientist 9. Course Platform for Data Scientist 10. Role Job for each Title, Primary Job of Data Scientist 11. Reguler Programming Languange by Job Title, especially for Data Scientist 12. Comparison Ability spesific programming and Compensation 13. What is the Languange programming learn first aspiring Data Scientist? 14. Integrated Development Environments reguler basis 15. Top 5 IDE and Which Country is using it. Microsoft not dominant in USA 16. What is Notebook as majority likely as a Reguler Basis. Google domination 17. Which Country and What Company use What Hardware for Mechine Learning 18. Role Job based on Spesific Company Type 19. Computer Vision method mostly used by Company 20. Distribution Company by each country 21. Cloud Product, Amazon domination, Goole follow 22. Big Data Product, Amazon majority in Enterprise, Google majority in All

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  12. n

    Data from: Assessing predictive performance of supervised machine learning...

    • data.niaid.nih.gov
    • datadryad.org
    • +1more
    zip
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evans Omondi (2023). Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model [Dataset]. http://doi.org/10.5061/dryad.wh70rxwrh
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 23, 2023
    Dataset provided by
    Strathmore University
    Authors
    Evans Omondi
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    The diamond is 58 times harder than any other mineral in the world, and its elegance as a jewel has long been appreciated. Forecasting diamond prices is challenging due to nonlinearity in important features such as carat, cut, clarity, table, and depth. Against this backdrop, the study conducted a comparative analysis of the performance of multiple supervised machine learning models (regressors and classifiers) in predicting diamond prices. Eight supervised machine learning algorithms were evaluated in this work including Multiple Linear Regression, Linear Discriminant Analysis, eXtreme Gradient Boosting, Random Forest, k-Nearest Neighbors, Support Vector Machines, Boosted Regression and Classification Trees, and Multi-Layer Perceptron. The analysis is based on data preprocessing, exploratory data analysis (EDA), training the aforementioned models, assessing their accuracy, and interpreting their results. Based on the performance metrics values and analysis, it was discovered that eXtreme Gradient Boosting was the most optimal algorithm in both classification and regression, with a R2 score of 97.45% and an Accuracy value of 74.28%. As a result, eXtreme Gradient Boosting was recommended as the optimal regressor and classifier for forecasting the price of a diamond specimen. Methods Kaggle, a data repository with thousands of datasets, was used in the investigation. It is an online community for machine learning practitioners and data scientists, as well as a robust, well-researched, and sufficient resource for analyzing various data sources. On Kaggle, users can search for and publish various datasets. In a web-based data-science environment, they can study datasets and construct models.

  13. Biometrics for stress monitoring

    • kaggle.com
    zip
    Updated May 12, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    qiriro (2019). Biometrics for stress monitoring [Dataset]. https://www.kaggle.com/qiriro/stress
    Explore at:
    zip(5623939830 bytes)Available download formats
    Dataset updated
    May 12, 2019
    Authors
    qiriro
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset organization

    This dataset comprises of heart rate variability (HRV) and Electrodermal activity (EDA) features (for more details on how the features were computed, please refer to our paper) computed from the SWELL and the WESAD datasets. It comprises of the following three directories and sub-directories:

    1. interim —contains the intermediate data that has been transformed directly from the raw datasets. The raw datasets are not included but they can be obtained from their respective publishers. This folder contains the following major contents:
    2. Labels —the ground truth of the experiments. For the details of how these were obtained, please refer to the papers included in the root of this director.
    3. eda —the raw eda signals
    4. rri —the inter-beat (RR) interval extracted from the electrocardiogram (ECG) signal
    5. processed —contains files that were computed from those in the intermediate directory facilitate the analysis
    6. final —files that were used for creating the model. This directory contains two sub-directories:
    7. datasets —contains the combined train, test a nd validation data used to create the model. For more details refer to section II of the paper ”The influence of person-specific biometrics in improving generic stress predictive models”
    8. results —contains the detailed results published in ”The influence of person-specific biometrics in improving generic stress predictive models”

      References

      Nkurikiyeyezu, K., Yokokubo, A., & Lopez, G. (2020). The Effect of Person-Specific Biometrics in Improving Generic Stress Predictive Models. Journal of Sensors & Material, 1–12. http://arxiv.org/abs/1910.01770

      Acknowledgements

      All the credit goes to the original authors of the WESAD and SWELL datasets for collecting and freely sharing the swell dataset. If you find this research helpful, please consider citing their papers

      1. S. Koldijk, M. A. Neerincx, and W. Kraaij, “Detecting Work Stress in Offices by Combining Unobtrusive Sensors,” IEEE Trans. Affect. Comput., vol. 9, no. 2, pp. 227–239, 2018.
      2. S. Koldijk, M. Sappelli, S. Verberne, M. A. Neerincx, and W. Kraaij, “The SWELL Knowledge Work Dataset for Stress and User Modeling Research,” Proc. 16th Int. Conf. Multimodal Interact. - ICMI ’14, pp. 291–298, 2014.
      3. Kraaij, Prof.dr.ir. W. (Radboud University & TNO); Koldijk, MSc. S. (TNO & Radboud University); Sappelli, MSc M. (TNO & Radboud University) (2014): The SWELL Knowledge Work Dataset for Stress and User Modeling Research. DANS. https://doi.org/10.17026/dans-x55-69zp
      4. Schmidt, P., Reiss, A., Duerichen, R., Marberger, C., & Van Laerhoven, K. (2018). Introducing WESAD, a Multimodal Dataset for Wearable Stress and Affect Detection. Proceedings of the 2018 on International Conference on Multimodal Interaction - ICMI ’18, 400–408. https://doi.org/10.1145/3242969.3242985
  14. Z

    A Dataset for Engagement Prediction in Neuromotor Disorder Patients during...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Dec 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Storm, Fabio Alexander (2024). A Dataset for Engagement Prediction in Neuromotor Disorder Patients during Rehabilitation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10812449
    Explore at:
    Dataset updated
    Dec 11, 2024
    Dataset provided by
    Chiappini, Mattia
    Costantini, Simone
    Dei, Carla
    Ambrosini, Emilia
    Andreoni, Giuseppe
    Bellazzecca, Silvia
    Malerba, Giorgia
    Falivene, Anna
    Storm, Fabio Alexander
    Biffi, Emilia
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset is related to the work titled "Artificial Intelligence Tools for Engagement Prediction in Neuromotor Disorder Patients undergoing Robot-Assisted Rehabilitation," whose aim is to methodologically explore the performance of artificial intelligence algorithms applied to structured datasets made of heart rate variability (HRV) and electrodermal activity (EDA) features to predict the level of patient engagement during robot-assisted gait rehabilitation (RAGR).

    It is composed of three Excel files related to the 3-minute windows data augmentation scenario applied to the bimodal dataset made of 14 HRV and 19 EDA features. Specifically:

    ds_bimodal_win_3min.xlsx contains the features extracted from 3-minute windows of HRV and EDA signals, recorded during the RAGR activity, and normalized with respect to the reference (baseline) signals. Features are not z-scored.

    labels_self_win_3min contains one single column with, for each row, the label related to the self-perceived engagement classification target.

    labels_therapist_win_3min contains one single column with, for each row, the label related to the therapist-perceived engagement classification target.

    The coding of classes for both classification targets is:

    0: Underchallenged

    1: Minimally Challenged

    2: Challenged

  15. A

    ‘COVID-19 dataset in Japan’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘COVID-19 dataset in Japan’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-covid-19-dataset-in-japan-2665/latest
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Japan
    Description

    Analysis of ‘COVID-19 dataset in Japan’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/lisphilar/covid19-dataset-in-japan on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    1. Context

    This is a COVID-19 dataset in Japan. This does not include the cases in Diamond Princess cruise ship (Yokohama city, Kanagawa prefecture) and Costa Atlantica cruise ship (Nagasaki city, Nagasaki prefecture). - Total number of cases in Japan - The number of vaccinated people (New/experimental) - The number of cases at prefecture level - Metadata of each prefecture

    Note: Lisphilar (author) uploads the same files to https://github.com/lisphilar/covid19-sir/tree/master/data

    This dataset can be retrieved with CovsirPhy (Python library).

    pip install covsirphy --upgrade
    
    import covsirphy as cs
    data_loader = cs.DataLoader()
    japan_data = data_loader.japan()
    # The number of cases (Total/each province)
    clean_df = japan_data.cleaned()
    # Metadata
    meta_df = japan_data.meta()
    

    Please refer to CovsirPhy Documentation: Japan-specific dataset.

    Note: Before analysing the data, please refer to Kaggle notebook: EDA of Japan dataset and COVID-19: Government/JHU data in Japan. The detailed explanation of the build process is discussed in Steps to build the dataset in Japan. If you find errors or have any questions, feel free to create a discussion topic.

    1.1 Total number of cases in Japan

    covid_jpn_total.csv Cumulative number of cases: - PCR-tested / PCR-tested and positive - with symptoms (to 08May2020) / without symptoms (to 08May2020) / unknown (to 08May2020) - discharged - fatal

    The number of cases: - requiring hospitalization (from 09May2020) - hospitalized with mild symptoms (to 08May2020) / severe symptoms / unknown (to 08May2020) - requiring hospitalization, but waiting in hotels or at home (to 08May2020)

    In primary source, some variables were removed on 09May2020. Values are NA in this dataset from 09May2020.

    Manually collected the data from Ministry of Health, Labour and Welfare HP:
    厚生労働省 HP (in Japanese)
    Ministry of Health, Labour and Welfare HP (in English)

    The number of vaccinated people: - Vaccinated_1st: the number of vaccinated persons for the first time on the date - Vaccinated_2nd: the number of vaccinated persons with the second dose on the date - Vaccinated_3rd: the number of vaccinated persons with the third dose on the date

    Data sources for vaccination: - To 09Apr2021: 厚生労働省 HP 新型コロナワクチンの接種実績(in Japanese) - 首相官邸 新型コロナワクチンについて - From 10APr2021: Twitter: 首相官邸(新型コロナワクチン情報)

    1.2 The number of cases at prefecture level

    covid_jpn_prefecture.csv Cumulative number of cases: - PCR-tested / PCR-tested and positive - discharged - fatal

    The number of cases: - requiring hospitalization (from 09May2020) - hospitalized with severe symptoms (from 09May2020)

    Using pdf-excel converter, manually collected the data from Ministry of Health, Labour and Welfare HP:
    厚生労働省 HP (in Japanese)
    Ministry of Health, Labour and Welfare HP (in English)

    Note: covid_jpn_prefecture.groupby("Date").sum() does not match covid_jpn_total. When you analyse total data in Japan, please use covid_jpn_total data.

    1.3 Metadata of each prefecture

    covid_jpn_metadata.csv - Population (Total, Male, Female): 厚生労働省 厚生統計要覧(2017年度)第1-5表 - Area (Total, Habitable): Wikipedia 都道府県の面積一覧 (2015)

    2. Acknowledgements

    To create this dataset, edited and transformed data of the following sites was used.

    厚生労働省 Ministry of Health, Labour and Welfare, Japan:
    厚生労働省 HP (in Japanese)
    Ministry of Health, Labour and Welfare HP (in English) 厚生労働省 HP 利用規約・リンク・著作権等 CC BY 4.0 (in Japanese)

    国土交通省 Ministry of Land, Infrastructure, Transport and Tourism, Japan: 国土交通省 HP (in Japanese) 国土交通省 HP (in English) 国土交通省 HP 利用規約・リンク・著作権等 CC BY 4.0 (in Japanese)

    Code for Japan / COVID-19 Japan: Code for Japan COVID-19 Japan Dashboard (CC BY 4.0) COVID-19 Japan 都道府県別 感染症病床数 (CC BY)

    Wikipedia: Wikipedia

    LinkData: LinkData (Public Domain)

    Inspiration

    1. Changes in number of cases over time
    2. Percentage of patients without symptoms / mild or severe symptoms
    3. What to do next to prevent outbreak

    License and how to cite

    Kindly cite this dataset under CC BY-4.0 license as follows. - Hirokazu Takaya (2020-2022), COVID-19 dataset in Japan, GitHub repository, https://github.com/lisphilar/covid19-sir/data/japan, or - Hirokazu Takaya (2020-2022), COVID-19 dataset in Japan, Kaggle Dataset, https://www.kaggle.com/lisphilar/covid19-dataset-in-japan

    --- Original source retains full ownership of the source dataset ---

  16. ERA5 Reanalysis Monthly Means

    • rda.ucar.edu
    • data.ucar.edu
    Updated Oct 6, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    European Centre for Medium-Range Weather Forecasts (2017). ERA5 Reanalysis Monthly Means [Dataset]. http://doi.org/10.5065/D63B5XW1
    Explore at:
    Dataset updated
    Oct 6, 2017
    Dataset provided by
    University Corporation for Atmospheric Research
    Authors
    European Centre for Medium-Range Weather Forecasts
    Time period covered
    Jan 1, 2008 - Dec 31, 2017
    Area covered
    Description

    Please note: Please use ds633.1 to access RDA maintained ERA-5 Monthly Mean data, see ERA5 Reanalysis (Monthly Mean 0.25 Degree Latitude-Longitude Grid), RDA dataset ds633.1. This dataset is no longer being updated, and web access has been removed.

    After many years of research and technical preparation, the production of a new ECMWF climate reanalysis to replace ERA-Interim is in progress. ERA5 is the fifth generation of ECMWF atmospheric reanalyses of the global climate, which started with the FGGE reanalyses produced in the 1980s, followed by ERA-15, ERA-40 and most recently ERA-Interim. ERA5 will cover the period January 1950 to near real time, though the first segment of data to be released will span the period 2010-2016.

    ERA5 is produced using high-resolution forecasts (HRES) at 31 kilometer resolution (one fourth the spatial resolution of the operational model) and a 62 kilometer resolution ten member 4D-Var ensemble of data assimilation (EDA) in CY41r2 of ECMWF's Integrated Forecast System (IFS) with 137 hybrid sigma-pressure (model) levels in the vertical, up to a top level of 0.01 hPa. Atmospheric data on these levels are interpolated to 37 pressure levels (the same levels as in ERA-Interim). Surface or single level data are also available, containing 2D parameters such as precipitation, 2 meter temperature, top of atmosphere radiation and vertical integrals over the entire atmosphere. The IFS is coupled to a soil model, the parameters of which are also designated as surface parameters, and an ocean wave model. Generally, the data is available at an hourly frequency and consists of analyses and short (18 hour) forecasts, initialized twice daily from analyses at 06 and 18 UTC. Most analyses parameters are also available from the forecasts. There are a number of forecast parameters, e.g. mean rates and accumulations, that are not available from the analyses. Together, the hourly analysis and twice daily forecast parameters form the basis of the monthly means (and monthly diurnal means) found in this dataset.

    Improvements to ERA5, compared to ERA-Interim, include use of HadISST.2, reprocessed ECMWF climate data records (CDR), and implementation of RTTOV11 radiative transfer. Variational bias corrections have not only been applied to satellite radiances, but also ozone retrievals, aircraft observations, surface pressure, and radiosonde profiles.

    NCAR's Data Support Section (DSS) is performing and supplying a grid transformed version of ERA5, in which variables originally represented as spectral coefficients or archived on a reduced Gaussian grid are transformed to a regular 1280 longitude by 640 latitude N320 Gaussian grid. In addition, DSS is also computing horizontal winds (u-component, v-component) from spectral vorticity and divergence where these are available. Finally, the data is reprocessed into single parameter time series.

    Please note: As of November 2017, DSS is also producing a CF 1.6 compliant netCDF-4/HDF5 version of ERA5 for CISL RDA at NCAR. The netCDF-4/HDF5 version is the de facto RDA ERA5 online data format. The GRIB1 data format is only available via NCAR's High Performance Storage System (HPSS). We encourage users to evaluate the netCDF-4/HDF5 version for their work, and to use the currently existing GRIB1 files as a reference and basis of comparison. To ease this transition, there is a one-to-one correspondence between the netCDF-4/HDF5 and GRIB1 files, with as much GRIB1 metadata as possible incorporated into the attributes of the netCDF-4/HDF5 counterpart.

  17. Store Sales - T.S Forecasting...Merged Dataset

    • kaggle.com
    Updated Dec 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shramana Bhattacharya (2021). Store Sales - T.S Forecasting...Merged Dataset [Dataset]. https://www.kaggle.com/shramanabhattacharya/store-sales-ts-forecastingmerged-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 15, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Shramana Bhattacharya
    Description

    This dataset is a merged dataset created from the data provided in the competition "Store Sales - Time Series Forecasting". The other datasets that were provided there apart from train and test (for example holidays_events, oil, stores, etc.) could not be used in the final prediction. According to my understanding, through the EDA of the merged dataset, we will be able to get a clearer picture of the other factors that might also affect the final prediction of grocery sales. Therefore, I created this merged dataset and posted it here for the further scope of analysis.

    ##### Data Description Data Field Information (This is a copy of the description as provided in the actual dataset)

    Train.csv - id: store id - date: date of the sale - store_nbr: identifies the store at which the products are sold. -**family**: identifies the type of product sold. - sales: gives the total sales for a product family at a particular store at a given date. Fractional values are possible since products can be sold in fractional units (1.5 kg of cheese, for instance, as opposed to 1 bag of chips). - onpromotion: gives the total number of items in a product family that were being promoted at a store on a given date. - Store metadata, including ****city, state, type, and cluster.**** - cluster is a grouping of similar stores. - Holidays and Events, with metadata NOTE: Pay special attention to the transferred column. A holiday that is transferred officially falls on that calendar day but was moved to another date by the government. A transferred day is more like a normal day than a holiday. To find the day that it was celebrated, look for the corresponding row where the type is Transfer. For example, the holiday Independencia de Guayaquil was transferred from 2012-10-09 to 2012-10-12, which means it was celebrated on 2012-10-12. Days that are type Bridge are extra days that are added to a holiday (e.g., to extend the break across a long weekend). These are frequently made up by the type Work Day which is a day not normally scheduled for work (e.g., Saturday) that is meant to pay back the Bridge. Additional holidays are days added to a regular calendar holiday, for example, as typically happens around Christmas (making Christmas Eve a holiday). - dcoilwtico: Daily oil price. Includes values during both the train and test data timeframes. (Ecuador is an oil-dependent country and its economic health is highly vulnerable to shocks in oil prices.)

    **Note: ***There is a transaction column in the training dataset which displays the sales transactions on that particular date. * Test.csv - The test data, having the same features like the training data. You will predict the target sales for the dates in this file. - The dates in the test data are for the 15 days after the last date in the training data. **Note: ***There is a no transaction column in the test dataset as was there in the training dataset. Therefore, while building the model, you might exclude this column and may use it only for EDA.*

    submission.csv - A sample submission file in the correct format.

  18. A

    ‘Grocery Store Prices, Mongolia’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Dec 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Grocery Store Prices, Mongolia’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-grocery-store-prices-mongolia-c5ff/497a27a8/?iid=007-673&v=presentation
    Explore at:
    Dataset updated
    Dec 3, 2021
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Mongolia
    Description

    Analysis of ‘Grocery Store Prices, Mongolia’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/robertritz/ub-market-prices on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    The National Statistics Office of Mongolia goes to each major market to record food prices each week in Ulaanbaatar, the capital city of Mongolia. The main purpose for this is to monitor a common basket of goods for use in consumer price index (CPI) calculations.

    Content

    The data is in a long-form, with date, market, product, and price recorded. All prices are in Mongolian Tugriks. As of 2021 the USD to MNT is about 2850 MNT = 1 USD.

    Acknowledgements

    This dataset is possible thanks to the hard work of the people of the National Statistics Office of Mongolia.

    Inspiration

    Often people choose supermarkets over the open markets (called a "zakh"). Mostly this is for convenience, but it is notable how much money people could save by choosing a different market!

    This would be a great dataset for EDA or looking at how prices change over time.

    --- Original source retains full ownership of the source dataset ---

  19. a

    Map Flint - 2014 EDA by bg ACS5YR Gross Rent

    • mapflint-umich.opendata.arcgis.com
    Updated Aug 15, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Michigan (2018). Map Flint - 2014 EDA by bg ACS5YR Gross Rent [Dataset]. https://mapflint-umich.opendata.arcgis.com/datasets/71c404dbb35b4430a612a4c8c6e05357
    Explore at:
    Dataset updated
    Aug 15, 2018
    Dataset authored and provided by
    University of Michigan
    Area covered
    Description

    Map Flint - Feature Service layer(s) : ACS5YR 2010-2014 estimates for UM-Flint U.S. EDA Region (MEDC Region 6), Michigan, USA by block group of Gross Rent. Data Dictionary: https://mapflint.org/dictionaries/2014_EDA_by_bg_ACS5YR_Gross_Rent_vars024_data_dictionary.pdf

    Note: Layer(s) not initially visible and must be turned on. This feature layer is an American Community Survey (ACS) estimate (U.S. Census Bureau) that is derived from the National Historical Geographic Information System (NHGIS) and has been customized for various Map Flint analyses and projects pertaining to the City of Flint, Genesee County, Michigan U.S.A. and other surrounding counties - e.g., counties and communities in the greater Flint vicinity that also overlap with the mission of the University of Michigan-Flint EDA University Center for Community and Economic Development. All NHGiS layers in Map Flint projects maintain the uniquely-valued GISJOIN geographic ID assigned by the NHGIS in order to work with multiple data sets. For more information, visit https://mapflint.org

  20. Measuring and Quantifying Mental Workload and Stress in Everyday Situations,...

    • zenodo.org
    zip
    Updated Jun 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christoph Anders; Christoph Anders; Ipsita Bhaduri; Ipsita Bhaduri; Bert Arnrich; Bert Arnrich (2025). Measuring and Quantifying Mental Workload and Stress in Everyday Situations, focusing on typical office activities [Dataset]. http://doi.org/10.5281/zenodo.15681263
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 17, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Christoph Anders; Christoph Anders; Ipsita Bhaduri; Ipsita Bhaduri; Bert Arnrich; Bert Arnrich
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2023 - 2024
    Description

    Dataset Description

    The dataset (approximately 80 hours in total) consists of physiological signals from wearable electroencephalography (EEG), electrodermal activity (EDA), photoplethysmogram (PPG), acceleration, and temperature sensors. The dataset was recorded from 10 participants who performed broadly pre-defined relaxation, reading, summarizing, and mental workload tasks. The consumer-grade physiological signals were obtained from the Muse S EEG headband and Empatica E4 wristband. The data is balanced across controlled and uncontrolled environments. During the study, participants worked on Stroop, N-Back, reading, summarizing, and relaxation tasks in the controlled environment (roughly half of the data) and realistic home-office tasks such as reading, summarizing, and relaxing in uncontrolled environments. Data labels were obtained using Likert scales and NASA-TLX questionnaires. The completely anonymized data set is publicly available and opens a vast potential to the research community working on mental workload detection using consumer-grade wearable sensors. Among others, the data is suitable for developing real-time cognitive load detection methods, research on signal processing techniques for challenging environments, developing artifact removal techniques from low-cost wearable devices' data, or developing personal mental workload assistants for scenarios such as scheduling just-in-time work-break recommendations.

    As literature for the reading and summarizing tasks, six scientific publications (Anagnos and Kiremidjian (1998); Nunes-Halldorson and Duran (2003); Mansouri et al. (2011); Kwak et al. (2018); Zhao et al. (2018); Fernbach et al. (2019)) were chosen as difficult texts, and six short-stories from famous English writers (Edgar Allan Poe (’The Gift of the Magi’, ’The Masque of the Red Death’, ’The Cask of Amontillado’, and ’The Black Cat’), Oscar Wilde (’The Devoted Friend’), and Charlotte Brontë (’The Search After Happiness’)) were chosen as easy texts.

    The link to the publication will be added once the manuscript is accepted in the respective journal.

    Technical Info

    The anonymized data is located in the subfolder 'dataset', in which the subfolders 'Participant 01' to 'Participant 10' hold data from individual participants. For each participant, three subfolders exist: 'Lab 1' and 'Lab 2' for the data recorded in the controlled environment, and 'In-the-wild' for the data recorded in uncontrolled environments. In the 'In-the-wild'-subfolder, numerated folders exist which contain the data for the respective recording, and a file called 'P#participant_wild_labels.csv' (e.g., 'P01_wild_labels.csv') contains the respective labels. Per recording (i.e., under 'Lab 1' and 'Lab 2' as well as '1' ... 'N' for the 'In-the-wild'-subfolder), the following files exist for the data recorded from the Empatica E4 ('ACC.csv', 'BVP.csv', 'EDA.csv', 'HR.csv', 'IBI.csv', 'info.txt', 'tags.csv', 'TEMP.csv') and for the data recorded from the Muse S ('P#participant_#recording_muse.csv', e.g. 'P2_wild1_muse.csv'). For the data recorded in the controlled environment (i.e., in 'Lab 1' and 'Lab 2'), two more files exist: '*_papers_*date.csv', and 'psychopy_log.log', each holding the experimental data recorded during the computerized mental workload tasks.

    Source Code

    Source Code is uploaded at: https://github.com/HPI-CH/mw_office_2025. Apart from the anonymized data, the source code to load, process, and extract features, as well as to perform statistical analyses and machine learning can be found in the Python script 'towards_general_cognitive_load_assistants_ML.py'. The Python script 'psychopy_csv_log_parser.py' is a helper script to analyze the log files generated during the recordings in the controlled environment. The Python script 'towards_general_cognitive_load_assistants_ML.py' is parameterized, and all the experiments reported in the paper can be reproduced using the Bash script 'Towards_General_Cognitive_Load_Assistants_ML.sh' provided. To run the source code, it is recommended to set up a virtual environment with the required libraries. The anaconda .yml-file 'anaconda_environment.yml' contains the required information about Python libraries to setup the anaconda environment 'neuroinf', which is activated automatically in the Bash script 'Towards_General_Cognitive_Load_Assistants_ML.sh' provided. Furthermore, in case you choose to reproduce and replicate the results using the source code provided, the empty folders 'ml_results' and 'stats_results' exist in which the respective results would automatically be stored.

    Contact

    Finally, please feel free to reach out should you encounter any issues or have any open questions regarding this data set, the source code, or the publication. You can reach the authors via the contact information provided in the publication or via email to 'christoph.anders@hpi.de' or 'christoph.anders@hpi.uni-potsdam.de' or 'office-arnrich@hpi.uni-potsdam.de' or 'mw_office_2025@hpi.de'.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Anvar Kamaleyev (2025). Synthetic HR Burnout Dataset [Dataset]. https://www.kaggle.com/datasets/ankam6010/synthetic-hr-burnout-dataset
Organization logo

Synthetic HR Burnout Dataset

Synthetic HR Dataset for Burnout Prediction (Binary Classification + EDA)

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 3, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Anvar Kamaleyev
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This dataset simulates employee-level data for burnout prediction and classification tasks. It can be used for binary classification, exploratory data analysis (EDA), and feature importance exploration.

📄 Columns Name — Synthetic employee name (for realism, not for ML use).

Age — Age of the employee.

Gender — Male or Female.

JobRole — Job type (Engineer, HR, Manager, etc.).

Experience — Years of work experience.

WorkHoursPerWeek — Average number of working hours per week.

RemoteRatio — % of time spent working remotely (0–100).

SatisfactionLevel — Self-reported satisfaction (1.0 to 5.0).

StressLevel — Self-reported stress level (1 to 10).

Burnout — Target variable. 1 if signs of burnout exist (high stress + low satisfaction + long hours), otherwise 0.

Search
Clear search
Close search
Google apps
Main menu