https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains 10,000 records of corporate employees across various departments, focusing on work hours, job satisfaction, and productivity performance. The dataset is designed for exploratory data analysis (EDA), performance benchmarking, and predictive modeling of productivity trends.
You can conduct EDA and investigate correlations between work hours, remote work, job satisfaction, and productivity. You can create new metrics like efficiency per hour or impact of meetings on productivity. Machine Learning Model: If you want a predictive task, you can use "Productivity_Score" as a regression target (predicting continuous performance scores). Or you can also create a classification problem (e.g., categorize employees into high, medium, or low productivity).
This dataset was created by Mohinur Abdurahimova
Released under Data files © Original Authors
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The complete dataset used in the analysis comprises 36 samples, each described by 11 numeric features and 1 target. The attributes considered were caspase 3/7 activity, Mitotracker red CMXRos area and intensity (3 h and 24 h incubations with both compounds), Mitosox oxidation (3 h incubation with the referred compounds) and oxidation rate, DCFDA fluorescence (3 h and 24 h incubations with either compound) and oxidation rate, and DQ BSA hydrolysis. The target of each instance corresponds to one of the 9 possible classes (4 samples per class): Control, 6.25, 12.5, 25 and 50 µM for 6-OHDA and 0.03, 0.06, 0.125 and 0.25 µM for rotenone. The dataset is balanced, it does not contain any missing values and data was standardized across features. The small number of samples prevented a full and strong statistical analysis of the results. Nevertheless, it allowed the identification of relevant hidden patterns and trends.
Exploratory data analysis, information gain, hierarchical clustering, and supervised predictive modeling were performed using Orange Data Mining version 3.25.1 [41]. Hierarchical clustering was performed using the Euclidean distance metric and weighted linkage. Cluster maps were plotted to relate the features with higher mutual information (in rows) with instances (in columns), with the color of each cell representing the normalized level of a particular feature in a specific instance. The information is grouped both in rows and in columns by a two-way hierarchical clustering method using the Euclidean distances and average linkage. Stratified cross-validation was used to train the supervised decision tree. A set of preliminary empirical experiments were performed to choose the best parameters for each algorithm, and we verified that, within moderate variations, there were no significant changes in the outcome. The following settings were adopted for the decision tree algorithm: minimum number of samples in leaves: 2; minimum number of samples required to split an internal node: 5; stop splitting when majority reaches: 95%; criterion: gain ratio. The performance of the supervised model was assessed using accuracy, precision, recall, F-measure and area under the ROC curve (AUC) metrics.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Eda_all is a dataset for instance segmentation tasks - it contains All annotations for 1,314 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset simulates employee-level data for burnout prediction and classification tasks. It can be used for binary classification, exploratory data analysis (EDA), and feature importance exploration.
📄 Columns Name — Synthetic employee name (for realism, not for ML use).
Age — Age of the employee.
Gender — Male or Female.
JobRole — Job type (Engineer, HR, Manager, etc.).
Experience — Years of work experience.
WorkHoursPerWeek — Average number of working hours per week.
RemoteRatio — % of time spent working remotely (0–100).
SatisfactionLevel — Self-reported satisfaction (1.0 to 5.0).
StressLevel — Self-reported stress level (1 to 10).
Burnout — Target variable. 1 if signs of burnout exist (high stress + low satisfaction + long hours), otherwise 0.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Solar Panel EDA is a dataset for object detection tasks - it contains Solar Panel annotations for 721 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Portobello Tech is an app innovator that has devised an intelligent way of predicting employee turnover within the company. It periodically evaluates employees' work details including the number of projects they worked upon, average monthly working hours, time spent in the company, promotions in the last 5 years, and salary level. Data from prior evaluations show the employee’s satisfaction at the workplace. The data could be used to identify patterns in work style and their interest to continue to work in the company. The HR Department owns the data and uses it to predict employee turnover. Employee turnover refers to the total number of workers who leave a company over a certain time period.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book series. It has 1 row and is filtered where the authors is Eda Kranakis. It features 4 columns: authors, books, and publication dates.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The diamond is 58 times harder than any other mineral in the world, and its elegance as a jewel has long been appreciated. Forecasting diamond prices is challenging due to nonlinearity in important features such as carat, cut, clarity, table, and depth. Against this backdrop, the study conducted a comparative analysis of the performance of multiple supervised machine learning models (regressors and classifiers) in predicting diamond prices. Eight supervised machine learning algorithms were evaluated in this work including Multiple Linear Regression, Linear Discriminant Analysis, eXtreme Gradient Boosting, Random Forest, k-Nearest Neighbors, Support Vector Machines, Boosted Regression and Classification Trees, and Multi-Layer Perceptron. The analysis is based on data preprocessing, exploratory data analysis (EDA), training the aforementioned models, assessing their accuracy, and interpreting their results. Based on the performance metrics values and analysis, it was discovered that eXtreme Gradient Boosting was the most optimal algorithm in both classification and regression, with a R2 score of 97.45% and an Accuracy value of 74.28%. As a result, eXtreme Gradient Boosting was recommended as the optimal regressor and classifier for forecasting the price of a diamond specimen. Methods Kaggle, a data repository with thousands of datasets, was used in the investigation. It is an online community for machine learning practitioners and data scientists, as well as a robust, well-researched, and sufficient resource for analyzing various data sources. On Kaggle, users can search for and publish various datasets. In a web-based data-science environment, they can study datasets and construct models.
An audit with findings and recommendations for improvements of management of EDA's Revolving Loan Fund program. This program is designed to provide grants to state and local governments, political subdivisions, and nonprofit organizations to operate lending programs to businesses that cannon get traditional bank financing.
Description: 👉 Download the dataset here This dataset was created to serve as an easy-to-use image dataset, perfect for experimenting with object detection algorithms. The main goal was to provide a simplified dataset that allows for quick setup and minimal effort in exploratory data analysis (EDA). This dataset is ideal for users who want to test and compare object detection models without spending too much time navigating complex data structures. Unlike datasets like chest x-rays, which… See the full description on the dataset page: https://huggingface.co/datasets/gtsaidata/V2-Balloon-Detection-Dataset.
🎵 Music Feature Dataset Analysis
This repository contains a comprehensive exploratory data analysis (EDA) on a music features dataset. The primary objective is to understand the patterns in audio features and analyze how they relate to user preferences, providing insights for music recommendation systems and user profiling.
📥 Dataset Overview
The dataset (data.csv) contains audio features extracted from music tracks along with user preference scores. This rich… See the full description on the dataset page: https://huggingface.co/datasets/JigneshPrajapati18/model_dataset.
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data Introduction • The Superheros_abilities_dataset is data that contains the abilities and attributes of 200 superheroes and villains from Marvel and the DC Universe, consisting of 10 columns: each character's name, moral orientation, strength, speed, intelligence, combat power, major weapons/capabilities, overall power, and popularity.
2) Data Utilization (1) Superheros_abilities_dataset has characteristics that: • This dataset is a small, refined dataset that reflects real-world situations with some missing values and is designed to be easily utilized by beginners. • A structure that reflects real-world situations with some missing values, a small, refined dataset designed for ease of use by beginners. (2) Superheros_abilities_dataset can be used to: • Classification Model Practice: It can be used to develop classification models that predict moral tendencies, such as Hero/Villain/Antihero, based on the character's ability values and attributes. • Cluster and Visualization: You can cluster groups of similar characters based on various abilities and attributes, or use them for EDA and data visualization exercises.
https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy
According to Cognitive Market Research, The Global EDA Market size will be USD 14.9 billion in 2023 and will grow at a compound annual growth rate (CAGR) of 10.50% from 2023 to 2030.
The demand for the EDA Market is rising due to the rise in outdoor and adventure activities.
Changing consumer lifestyle trends are higher in the EDA market.
The cat segment held the highest EDA Market revenue share in 2023.
North American EDA will continue to lead, whereas the European EDA Market will experience the most substantial growth until 2030.
Supply Chain and Risk Analysis to Provide Viable Market Output
The industry is facing supply chain and logistics disruptions. EDA tools have been instrumental in analyzing supply chain data, identifying vulnerabilities, predicting risks, and developing disruption mitigation strategies. Consumer behavior has undergone drastic changes due to blockages and restrictions. EDA helps companies analyze changing trends in buying behavior, online shopping preferences, and demand patterns, enabling organizations to adjust their marketing and sales strategies accordingly.
Health and Pharmaceutical Research to Propel Market Growth.
EDA tools have played a key role in analyzing large amounts of data related to vaccine development, drug trials, patient records and epidemiological studies. These tools have helped researchers process and interpret complex medical data, leading to advances in the development of treatments and vaccines. The pandemic has created challenges in data collection, especially in sectors affected by lockdowns or blackouts. Rapidly changing conditions and incomplete data sets make effective EDA difficult due to data quality issues. The economic uncertainty caused by the pandemic has led to budget cuts in some sectors, impacting investment in new technologies. Some organizations have limited budgets that limit their ability to adopt or update EDA tools.
Market Dynamics of the EDA
Privacy and Data Security Issues to Restrict Market Growth.
With the focus on data privacy regulations such as GDPR, CCPA, etc., organizations need to ensure compliance when handling sensitive data. These compliance requirements may limit the scope of the EDA by limiting the availability and use of certain data sets for information analysis. EDA often requires data analysts or data scientists who are skilled in statistical analysis and data visualization tools. A lack of professionals with these specialized skills can hinder an organization's ability to use EDA tools effectively, limiting adoption. Advanced EDA techniques can involve complex algorithms and statistical techniques that are difficult for non-technical users to understand. Interpreting results and deriving actionable insights from EDA results pose challenges that affect applicability to a wider audience.
Key Opportunity of market.
Growing miniaturization in various industries can be an opportunity.
With the age of highly advanced electronics, miniaturization has become a trend that enabled organizations across diverse sectors such as healthcare, consumer electronics, aerospace and defense, automotive and others to design miniature electronic devices. The devices incorporate miniaturized semiconductor components, e.g., surgical instruments and blood glucose meters in healthcare, fitness bands in wearable devices, automotive modules in the automotive sector, and intelligent baggage labels. Miniaturization has a number of advantages such as freeing space for other features and better batteries. The increased consciousness among consumers towards fitness is fueling the demand for smaller fitness devices such as smartwatches and fitness trackers. This is motivating companies to come up with innovative products with improved features, while researchers are concentrating on cost-effective and efficient product development through electronic design tools. Besides, use of portable equipment has gained immense popularity among media professionals because of the increasing demand for live reporting of different events like riots, accidents, sports, and political rallies. As a result of the inconvenience in the use of cumbersome TV production vans to access such events, demand for portable handheld equipment has risen. Such devices are simply portable and can be quickly moved to the event venue if carried in backpacks. Therefore, the need for compact devices across various indust...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This paper introduces GLARE an Arabic Apps Reviews dataset collected from Saudi Google PlayStore. It consists of 76M reviews, 69M of which are Arabic reviews of 9,980 Android Applications. We present the data collection methodology, along with a detailed Exploratory Data Analysis (EDA) and Feature Engineering on the gathered reviews. We also highlight possible use cases and benefits of the dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Transactional Retail Dataset of Electronics Store’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/muhammadshahrayar/transactional-retail-dataset-of-electronics-store on 14 February 2022.
--- Dataset description provided by original source is as follows ---
This dataset contains information about an online electronic store. The store has three warehouses from which goods are delivered to customers.
Use this dataset to perform graphical and/or non-graphical EDA methods to understand the data first and then find and fix the data problems. - Detect and fix errors in dirty_data.csv - Impute the missing values in missing_data.csv - Detect and remove Anolamies - To check whether a customer is happy with their last order
All the Best
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘COVID-19 dataset in Japan’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/lisphilar/covid19-dataset-in-japan on 28 January 2022.
--- Dataset description provided by original source is as follows ---
This is a COVID-19 dataset in Japan. This does not include the cases in Diamond Princess cruise ship (Yokohama city, Kanagawa prefecture) and Costa Atlantica cruise ship (Nagasaki city, Nagasaki prefecture). - Total number of cases in Japan - The number of vaccinated people (New/experimental) - The number of cases at prefecture level - Metadata of each prefecture
Note: Lisphilar (author) uploads the same files to https://github.com/lisphilar/covid19-sir/tree/master/data
This dataset can be retrieved with CovsirPhy (Python library).
pip install covsirphy --upgrade
import covsirphy as cs
data_loader = cs.DataLoader()
japan_data = data_loader.japan()
# The number of cases (Total/each province)
clean_df = japan_data.cleaned()
# Metadata
meta_df = japan_data.meta()
Please refer to CovsirPhy Documentation: Japan-specific dataset.
Note: Before analysing the data, please refer to Kaggle notebook: EDA of Japan dataset and COVID-19: Government/JHU data in Japan. The detailed explanation of the build process is discussed in Steps to build the dataset in Japan. If you find errors or have any questions, feel free to create a discussion topic.
covid_jpn_total.csv
Cumulative number of cases:
- PCR-tested / PCR-tested and positive
- with symptoms (to 08May2020) / without symptoms (to 08May2020) / unknown (to 08May2020)
- discharged
- fatal
The number of cases: - requiring hospitalization (from 09May2020) - hospitalized with mild symptoms (to 08May2020) / severe symptoms / unknown (to 08May2020) - requiring hospitalization, but waiting in hotels or at home (to 08May2020)
In primary source, some variables were removed on 09May2020. Values are NA in this dataset from 09May2020.
Manually collected the data from Ministry of Health, Labour and Welfare HP:
厚生労働省 HP (in Japanese)
Ministry of Health, Labour and Welfare HP (in English)
The number of vaccinated people:
- Vaccinated_1st
: the number of vaccinated persons for the first time on the date
- Vaccinated_2nd
: the number of vaccinated persons with the second dose on the date
- Vaccinated_3rd
: the number of vaccinated persons with the third dose on the date
Data sources for vaccination: - To 09Apr2021: 厚生労働省 HP 新型コロナワクチンの接種実績(in Japanese) - 首相官邸 新型コロナワクチンについて - From 10APr2021: Twitter: 首相官邸(新型コロナワクチン情報)
covid_jpn_prefecture.csv
Cumulative number of cases:
- PCR-tested / PCR-tested and positive
- discharged
- fatal
The number of cases: - requiring hospitalization (from 09May2020) - hospitalized with severe symptoms (from 09May2020)
Using pdf-excel converter, manually collected the data from Ministry of Health, Labour and Welfare HP:
厚生労働省 HP (in Japanese)
Ministry of Health, Labour and Welfare HP (in English)
Note:
covid_jpn_prefecture.groupby("Date").sum()
does not match covid_jpn_total
.
When you analyse total data in Japan, please use covid_jpn_total
data.
covid_jpn_metadata.csv
- Population (Total, Male, Female): 厚生労働省 厚生統計要覧(2017年度)第1-5表
- Area (Total, Habitable): Wikipedia 都道府県の面積一覧 (2015)
Hospital_bed: With the primary data of 厚生労働省 感染症指定医療機関の指定状況(平成31年4月1日現在), 厚生労働省 第二種感染症指定医療機関の指定状況(平成31年4月1日現在), 厚生労働省 医療施設動態調査(令和2年1月末概数), 厚生労働省 感染症指定医療機関について and secondary data of COVID-19 Japan 都道府県別 感染症病床数,
Clinic_bed: With the primary data of 医療施設動態調査(令和2年1月末概数) ,
Location: Data is from LinkData 都道府県庁所在地 (Public Domain) (secondary data).
Admin
To create this dataset, edited and transformed data of the following sites was used.
厚生労働省 Ministry of Health, Labour and Welfare, Japan:
厚生労働省 HP (in Japanese)
Ministry of Health, Labour and Welfare HP (in English)
厚生労働省 HP 利用規約・リンク・著作権等 CC BY 4.0 (in Japanese)
国土交通省 Ministry of Land, Infrastructure, Transport and Tourism, Japan: 国土交通省 HP (in Japanese) 国土交通省 HP (in English) 国土交通省 HP 利用規約・リンク・著作権等 CC BY 4.0 (in Japanese)
Code for Japan / COVID-19 Japan: Code for Japan COVID-19 Japan Dashboard (CC BY 4.0) COVID-19 Japan 都道府県別 感染症病床数 (CC BY)
Wikipedia: Wikipedia
LinkData: LinkData (Public Domain)
Kindly cite this dataset under CC BY-4.0 license as follows. - Hirokazu Takaya (2020-2022), COVID-19 dataset in Japan, GitHub repository, https://github.com/lisphilar/covid19-sir/data/japan, or - Hirokazu Takaya (2020-2022), COVID-19 dataset in Japan, Kaggle Dataset, https://www.kaggle.com/lisphilar/covid19-dataset-in-japan
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Justcorners is a dataset for object detection tasks - it contains Money annotations for 901 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
https://www.usa.gov/government-workshttps://www.usa.gov/government-works
The formula-driven calculation projects investment data at 3, 6, and 9 year intervals from the investment award. The formula is based on a study done by Rutgers University, which compiled and analyzed the performance of EDA construction investments after 9 years. This approach was reviewed and validated by third-party analysis conducted by Grant Thornton in 2008. Based on this formula and a review of EDA's historical results, EDA estimates that 40% of the 9-year projection would be realized after 3 years, 75% after 6 years, and 100% after 9 years.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains 10,000 records of corporate employees across various departments, focusing on work hours, job satisfaction, and productivity performance. The dataset is designed for exploratory data analysis (EDA), performance benchmarking, and predictive modeling of productivity trends.
You can conduct EDA and investigate correlations between work hours, remote work, job satisfaction, and productivity. You can create new metrics like efficiency per hour or impact of meetings on productivity. Machine Learning Model: If you want a predictive task, you can use "Productivity_Score" as a regression target (predicting continuous performance scores). Or you can also create a classification problem (e.g., categorize employees into high, medium, or low productivity).