16 datasets found

f
EDA augmentation parameters.
plos.figshare.com
xls
Updated Sep 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
EDA augmentation parameters. [Dataset]. https://plos.figshare.com/articles/dataset/EDA_augmentation_parameters_/27112619
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310707.t009
Dataset updated
Sep 26, 2024
Dataset provided by
PLOS ONE
Authors
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.
Exploratory Data Analysis (EDA) for COVIND-19
kaggle.com
Updated Apr 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Badea-Matei Iuliana (2024). Exploratory Data Analysis (EDA) for COVIND-19 [Dataset]. https://www.kaggle.com/datasets/mateiiuliana/exploratory-data-analysis-eda-for-covind-19
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 9, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Badea-Matei Iuliana
Description
Description: The COVID-19 dataset used for this EDA project encompasses comprehensive data on COVID-19 cases, deaths, and recoveries worldwide. It includes information gathered from authoritative sources such as the World Health Organization (WHO), the Centers for Disease Control and Prevention (CDC), and national health agencies. The dataset covers global, regional, and national levels, providing a holistic view of the pandemic's impact.

Purpose: This dataset is instrumental in understanding the multifaceted impact of the COVID-19 pandemic through data exploration. It aligns perfectly with the objectives of the EDA project, aiming to unveil insights, patterns, and trends related to COVID-19. Here are the key objectives: 1. Data Collection and Cleaning: • Gather reliable COVID-19 datasets from authoritative sources (such as WHO, CDC, or national health agencies). • Clean and preprocess the data to ensure accuracy and consistency. 2. Descriptive Statistics: • Summarize key statistics: total cases, recoveries, deaths, and testing rates. • Visualize temporal trends using line charts, bar plots, and heat maps. 3. Geospatial Analysis: • Map COVID-19 cases across countries, regions, or cities. • Identify hotspots and variations in infection rates. 4. Demographic Insights: • Explore how age, gender, and pre-existing conditions impact vulnerability. • Investigate disparities in infection rates among different populations. 5. Healthcare System Impact: • Analyze hospitalization rates, ICU occupancy, and healthcare resource allocation. • Assess the strain on medical facilities. 6. Economic and Social Effects: • Investigate the relationship between lockdown measures, economic indicators, and infection rates. • Explore behavioral changes (e.g., mobility patterns, remote work) during the pandemic. 7. Predictive Modeling (Optional): • If data permits, build simple predictive models (e.g., time series forecasting) to estimate future cases.

Data Sources: The primary sources of the COVID-19 dataset include the Johns Hopkins CSSE COVID-19 Data Repository, Google Health’s COVID-19 Open Data, and the U.S. Economic Development Administration (EDA). These sources provide reliable and up-to-date information on COVID-19 cases, deaths, testing rates, and other relevant variables. Additionally, GitHub repositories and platforms like Medium host supplementary datasets and analyses, enriching the available data resources.

Data Format: The dataset is available in various formats, such as CSV and JSON, facilitating easy access and analysis. Before conducting the EDA, the data underwent preprocessing steps to ensure accuracy and consistency. Data cleaning procedures were performed to address missing values, inconsistencies, and outliers, enhancing the quality and reliability of the dataset.

License: The COVID-19 dataset may be subject to specific usage licenses or restrictions imposed by the original data sources. Proper attribution is essential to acknowledge the contributions of the WHO, CDC, national health agencies, and other entities providing the data. Users should adhere to any licensing terms and usage guidelines associated with the dataset.

Attribution: We acknowledge the invaluable contributions of the World Health Organization (WHO), the Centers for Disease Control and Prevention (CDC), national health agencies, and other authoritative sources in compiling and disseminating the COVID-19 data used for this EDA project. Their efforts in collecting, curating, and sharing data have been instrumental in advancing our understanding of the pandemic and guiding public health responses globally.
T
Impact of AI in Education Processes
dataverse.tdl.org
Updated Feb 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saksham Adhikari; Saksham Adhikari (2024). Impact of AI in Education Processes [Dataset]. http://doi.org/10.18738/T8/RXUCHK
Explore at:
application/x-ipynb+json(428065), pptx(80640), tsv(7079)Available download formats
Unique identifier
https://doi.org/10.18738/T8/RXUCHK
Dataset updated
Feb 20, 2024
Dataset provided by
Texas Data Repository
Authors
Saksham Adhikari; Saksham Adhikari
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
We did data analysis on a open dataset which contained responses regarding a survey about how useful students find AI in the educational process. We cleaned the data, preprocessed and then did analysis on it. We did an EDA (Exploratory Data Analysis) on the dataset and visualized the results and our findings. Then we interpreted the findings into our digital poster.
The Global EDA Market size was USD 14.9 billion in 2023!
cognitivemarketresearch.com
pdf,excel,csv,ppt
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cognitive Market Research, The Global EDA Market size was USD 14.9 billion in 2023! [Dataset]. https://www.cognitivemarketresearch.com/eda-market-report
Explore at:
pdf,excel,csv,pptAvailable download formats
Dataset authored and provided by
Cognitive Market Research
License
https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy
Time period covered
2021 - 2033
Area covered
Global
Description
According to Cognitive Market Research, The Global EDA Market size will be USD 14.9 billion in 2023 and will grow at a compound annual growth rate (CAGR) of 10.50% from 2023 to 2030.

The demand for the EDA Market is rising due to the rise in outdoor and adventure activities. Changing consumer lifestyle trends are higher in the EDA market. The cat segment held the highest EDA Market revenue share in 2023. North American EDA will continue to lead, whereas the European EDA Market will experience the most substantial growth until 2030.

Supply Chain and Risk Analysis to Provide Viable Market Output

The industry is facing supply chain and logistics disruptions. EDA tools have been instrumental in analyzing supply chain data, identifying vulnerabilities, predicting risks, and developing disruption mitigation strategies. Consumer behavior has undergone drastic changes due to blockages and restrictions. EDA helps companies analyze changing trends in buying behavior, online shopping preferences, and demand patterns, enabling organizations to adjust their marketing and sales strategies accordingly.

Health and Pharmaceutical Research to Propel Market Growth.

EDA tools have played a key role in analyzing large amounts of data related to vaccine development, drug trials, patient records and epidemiological studies. These tools have helped researchers process and interpret complex medical data, leading to advances in the development of treatments and vaccines. The pandemic has created challenges in data collection, especially in sectors affected by lockdowns or blackouts. Rapidly changing conditions and incomplete data sets make effective EDA difficult due to data quality issues. The economic uncertainty caused by the pandemic has led to budget cuts in some sectors, impacting investment in new technologies. Some organizations have limited budgets that limit their ability to adopt or update EDA tools.

Market Dynamics of the EDA

Privacy and Data Security Issues to Restrict Market Growth.

With the focus on data privacy regulations such as GDPR, CCPA, etc., organizations need to ensure compliance when handling sensitive data. These compliance requirements may limit the scope of the EDA by limiting the availability and use of certain data sets for information analysis. EDA often requires data analysts or data scientists who are skilled in statistical analysis and data visualization tools. A lack of professionals with these specialized skills can hinder an organization's ability to use EDA tools effectively, limiting adoption. Advanced EDA techniques can involve complex algorithms and statistical techniques that are difficult for non-technical users to understand. Interpreting results and deriving actionable insights from EDA results pose challenges that affect applicability to a wider audience.

Key Opportunity of market.

Growing miniaturization in various industries can be an opportunity.

With the age of highly advanced electronics, miniaturization has become a trend that enabled organizations across diverse sectors such as healthcare, consumer electronics, aerospace and defense, automotive and others to design miniature electronic devices. The devices incorporate miniaturized semiconductor components, e.g., surgical instruments and blood glucose meters in healthcare, fitness bands in wearable devices, automotive modules in the automotive sector, and intelligent baggage labels. Miniaturization has a number of advantages such as freeing space for other features and better batteries. The increased consciousness among consumers towards fitness is fueling the demand for smaller fitness devices such as smartwatches and fitness trackers. This is motivating companies to come up with innovative products with improved features, while researchers are concentrating on cost-effective and efficient product development through electronic design tools. Besides, use of portable equipment has gained immense popularity among media professionals because of the increasing demand for live reporting of different events like riots, accidents, sports, and political rallies. As a result of the inconvenience in the use of cumbersome TV production vans to access such events, demand for portable handheld equipment has risen. Such devices are simply portable and can be quickly moved to the event venue if carried in backpacks. Therefore, the need for compact devices across various indust...
Preventive Maintenance for Marine Engines
kaggle.com
Updated Feb 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fijabi J. Adekunle (2025). Preventive Maintenance for Marine Engines [Dataset]. https://www.kaggle.com/datasets/jeleeladekunlefijabi/preventive-maintenance-for-marine-engines
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 13, 2025
Dataset provided by
Kaggle
Authors
Fijabi J. Adekunle
Description
Preventive Maintenance for Marine Engines: Data-Driven Insights

Introduction:

Marine engine failures can lead to costly downtime, safety risks and operational inefficiencies. This project leverages machine learning to predict maintenance needs, helping ship operators prevent unexpected breakdowns. Using a simulated dataset, we analyze key engine parameters and develop predictive models to classify maintenance status into three categories: Normal, Requires Maintenance, and Critical.

Overview This project explores preventive maintenance strategies for marine engines by analyzing operational data and applying machine learning techniques.

Key steps include: 1. Data Simulation: Creating a realistic dataset with engine performance metrics. 2. Exploratory Data Analysis (EDA): Understanding trends and patterns in engine behavior. 3. Model Training & Evaluation: Comparing machine learning models (Decision Tree, Random Forest, XGBoost) to predict maintenance needs. 4. Hyperparameter Tuning: Using GridSearchCV to optimize model performance.

Tools Used 1. Python: Data processing, analysis and modeling 2. Pandas & NumPy: Data manipulation 3. Scikit-Learn & XGBoost: Machine learning model training 4. Matplotlib & Seaborn: Data visualization

Skills Demonstrated ✔ Data Simulation & Preprocessing ✔ Exploratory Data Analysis (EDA) ✔ Feature Engineering & Encoding ✔ Supervised Machine Learning (Classification) ✔ Model Evaluation & Hyperparameter Tuning

Key Insights & Findings 📌 Engine Temperature & Vibration Level: Strong indicators of potential failures. 📌 Random Forest vs. XGBoost: After hyperparameter tuning, both models achieved comparable performance, with Random Forest performing slightly better. 📌 Maintenance Status Distribution: Balanced dataset ensures unbiased model training. 📌 Failure Modes: The most common issues were Mechanical Wear & Oil Leakage, aligning with real-world engine failure trends.

Challenges Faced 🚧 Simulating Realistic Data: Ensuring the dataset reflects real-world marine engine behavior was a key challenge. 🚧 Model Performance: The accuracy was limited (~35%) due to the complexity of failure prediction. 🚧 Feature Selection: Identifying the most impactful features required extensive analysis.

Call to Action 🔍 Explore the Dataset & Notebook: Try running different models and tweaking hyperparameters. 📊 Extend the Analysis: Incorporate additional sensor data or alternative machine learning techniques. 🚀 Real-World Application: This approach can be adapted for industrial machinery, aircraft engines, and power plants.
m
PhysioPain Dataset
data.mendeley.com
Updated Feb 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fatma Patlar Akbulut (2025). PhysioPain Dataset [Dataset]. http://doi.org/10.17632/mf2cgph9cy.4
Explore at:
Unique identifier
https://doi.org/10.17632/mf2cgph9cy.4
Dataset updated
Feb 11, 2025
Authors
Fatma Patlar Akbulut
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Pain is considered subjective since it is based on personal experience; however, it may be possible to objectively analyze pain through a self-reporting system supported by bio-signals. Based on this hypothesis, a multimodal dataset is created, combining EEG and wristband signals (including EDA, BVP, temperature, accelerometer data) along with participants' responses to a survey, including the McGill Pain Questionnaire. This dataset, collected from 99 participants, allows for the analysis of three different types of pain: headache, back pain, and menstrual pain. E4 Empatica wristband and Mindwave Mobile 2 EEG device are used to collect the data. Both raw and processed data of the devices with survey answers are stored in the dataset by participants' unique IDs.

Credit Card Approval Prediction

kaggle.com

Updated Mar 24, 2020

Facebook

Twitter

Click to copy link

Link copied

Cite

Seanny (2020). Credit Card Approval Prediction [Dataset]. https://www.kaggle.com/datasets/rikdifos/credit-card-approval-prediction

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Mar 24, 2020

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Seanny

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

A Credit Card Dataset for Machine Learning!

Don't ask me where this data come from, the answer is I don't know!

Context

Credit score cards are a common risk control method in the financial industry. It uses personal information and data submitted by credit card applicants to predict the probability of future defaults and credit card borrowings. The bank is able to decide whether to issue a credit card to the applicant. Credit scores can objectively quantify the magnitude of risk.

Generally speaking, credit score cards are based on historical data. Once encountering large economic fluctuations. Past models may lose their original predictive power. Logistic model is a common method for credit scoring. Because Logistic is suitable for binary classification tasks and can calculate the coefficients of each feature. In order to facilitate understanding and operation, the score card will multiply the logistic regression coefficient by a certain value (such as 100) and round it.

At present, with the development of machine learning algorithms. More predictive methods such as Boosting, Random Forest, and Support Vector Machines have been introduced into credit card scoring. However, these methods often do not have good transparency. It may be difficult to provide customers and regulators with a reason for rejection or acceptance.

Task

Build a machine learning model to predict if an applicant is 'good' or 'bad' client, different from other tasks, the definition of 'good' or 'bad' is not given. You should use some techique, such as vintage analysis to construct you label. Also, unbalance data problem is a big problem in this task.

Content & Explanation

There're two tables could be merged by ID:

application_record.csv
Feature name	Explanation	Remarks
`ID`	Client number
`CODE_GENDER`	Gender
`FLAG_OWN_CAR`	Is there a car
`FLAG_OWN_REALTY`	Is there a property
`CNT_CHILDREN`	Number of children
`AMT_INCOME_TOTAL`	Annual income
`NAME_INCOME_TYPE`	Income category
`NAME_EDUCATION_TYPE`	Education level
`NAME_FAMILY_STATUS`	Marit...

CERRA sub-daily regional reanalysis data for Europe on single levels from...
cds.climate.copernicus.eu
grib
Updated May 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ECMWF (2025). CERRA sub-daily regional reanalysis data for Europe on single levels from 1984 to present [Dataset]. http://doi.org/10.24381/cds.622a565a
Explore at:
gribAvailable download formats
Unique identifier
https://doi.org/10.24381/cds.622a565a
Dataset updated
May 27, 2025
Dataset provided by
European Centre for Medium-Range Weather Forecastshttp://ecmwf.int/
Authors
ECMWF
License
https://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/licence-to-use-copernicus-products/licence-to-use-copernicus-products_b4b9451f54cffa16ecef5c912c9cebd6979925a956e3fa677976e0cf198c2c18.pdfhttps://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/licence-to-use-copernicus-products/licence-to-use-copernicus-products_b4b9451f54cffa16ecef5c912c9cebd6979925a956e3fa677976e0cf198c2c18.pdf
Time period covered
Sep 1, 1984 - Aug 31, 2023
Area covered
Europe
Description
The Copernicus European Regional ReAnalysis (CERRA) datasets provide spatially and temporally consistent historical reconstructions of meteorological variables in the atmosphere and at the surface. There are four subsets: single levels (atmospheric and surface quantities), height levels (upper-air fields up to 500m), pressure levels (upper-air fields up to 1hPa) and model levels (native levels of the model). This entry provides reanalysis and forecast data on single levels for Europe from 1984 to present. Several atmospheric parameters are common to both reanalysis and forecast (e.g. temperature, wind), whilst others are produced only by the forecast model (e.g. 10m wind gust, radiative fluxes). Reanalysis combines model data with observations into a complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved, reprocessed, versions of the original observations, which all benefit the quality of the reanalysis product. The CERRA dataset was produced using the HARMONIE-ALADIN limited-area numerical weather prediction and data assimilation system, hereafter referred to as the CERRA system. The CERRA system employs a 3-dimensional variational data assimilation scheme of the atmospheric state at every assimilation time. The reanalysis dataset is convenient owing to its provision of atmospheric estimates at each model domain grid point over Europe for each regular output time, over a long period, and always using the same data format. The inputs to CERRA reanalysis are the observational data, lateral boundary conditions from ERA5 global reanalysis as prior estimates of the atmospheric state and physiographic datasets describing the surface characteristics of the model. The observing system has evolved over time, and although the data assimilation system can resolve data holes, the much sparser observational networks in the past periods (for example a reduced amount of satellite data in the 1980s) can impact the quality of analyses leading to less accurate estimates. The uncertainty estimates for reanalysis variables are provided by the CERRA-EDA, a 10-member ensemble of data assimilation system. The added value of the CERRA data with respect to the global reanalysis products is expected to come, for example, with the higher horizontal resolution that permits the usage of a better description of the model topography and physiographic data, and the assimilation of more surface observations. More information about the CERRA dataset can be found in the Documentation section.
McKinsey Solve Assessment Data (2018–2025)
kaggle.com
Updated May 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oluwademilade Adeniyi (2025). McKinsey Solve Assessment Data (2018–2025) [Dataset]. http://doi.org/10.34740/kaggle/dsv/11720554
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/11720554
Dataset updated
May 7, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Oluwademilade Adeniyi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
McKinsey Solve Global Assessment Dataset (2018–2025)

🧠 Context

McKinsey's Solve is a gamified problem-solving assessment used globally in the consulting firm’s recruitment process. This dataset simulates assessment results across geographies, education levels, and roles over a 7-year period. It aims to provide deep insights into performance trends, candidate readiness, resume quality, and cognitive task outcomes.

📌 Inspiration & Purpose

Inspired by McKinsey’s real-world assessment framework, this dataset was designed to enable: - Exploratory Data Analysis (EDA) - Recruitment trend analysis - Gamified performance modelling - Dashboard development in Excel / Power BI - Resume and education impact evaluation - Regional performance benchmarking - Data storytelling for portfolio projects

Whether you're building dashboards or training models, this dataset offers practical and relatable data for HR analytics and consulting use cases.

🔍 Dataset Source

Data generated by Oluwademilade Adeniyi (Demibolt) with the assistance of ChatGPT by OpenAI Structure and logic inspired by McKinsey’s public-facing Solve information, including role categories, game types (Ecosystem, Redrock, Seawolf), education tiers, and global office locations The entire dataset is synthetic and designed for analytical learning, ethical use, and professional development

🧾 Dataset Structure

This dataset includes 4,000 rows and the following columns: - Testtaker ID: Unique identifier - Country / Region: Geographic segmentation - Gender / Age: Demographics - Year: Assessment year (2018–2025) - Highest Level of Education: From high school to PhD / MBA - School or University Attended: Mapped to country and education level - First-generation University Student: Yes/No - Employment Status: Student, Employed, Unemployed - Role Applied For and Department / Interest: Business/tech disciplines - Past Test Taker: Indicates repeat attempts - Prepared with Online Materials: Indicates test prep involvement - Desired Office Location: Mapped to McKinsey's international offices - Ecosystem / Redrock / Seawolf (%): Game performance scores - Time Spent on Each Game (mins) - Total Product Score: Average of the 3 game scores - Process Score: A secondary assessment component - Resume Score: Scored based on education prestige, role fit, and clarity - Total Assessment Score (%): Final decision metric - Status (Pass/Fail): Based on total score ≥ 75%

✅ Why Use This Dataset

Benchmark educational and regional trends in global assessments

Build KPI cards, donut charts, histograms, or speedometer visuals

Train pass/fail classifiers or regression models

Segment job applicants by role, location, or game behaviour

Showcase portfolio skills across Excel, SQL, Power BI, Python, or R

Test dashboards or predictive logic in a business-relevant scenario

💡 Credit & Collaboration

Data Creator: Oluwademilade Adeniyi (Me) (LinkedIn, Twitter, GitHub, Medium)

Collaborator: ChatGPT by OpenAI

Inspired by: McKinsey & Company’s Solve Assessment
f
SentiGAN configuration parameters.
plos.figshare.com
xls
Updated Sep 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda (2024). SentiGAN configuration parameters. [Dataset]. http://doi.org/10.1371/journal.pone.0310707.t010
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310707.t010
Dataset updated
Sep 26, 2024
Dataset provided by
PLOS ONE
Authors
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.
NeuroBioSense Dataset
kaggle.com
Updated Apr 9, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Orvile (2025). NeuroBioSense Dataset [Dataset]. http://doi.org/10.17632/7md7yty9sk.2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.17632/7md7yty9sk.2
Dataset updated
Apr 9, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Orvile
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description

The dataset provided in this study serves as a helpful resource for comprehending emotional reactions within the domain of neuromarketing. Data was acquired from a sample of 58 participants, ranging in age from 18 to 70, through the implementation of a carefully planned experimental design. The participants were shown a collection of 35 branding advertisements classified into three categories: cosmetics and fashion, car and technology, and food and market. The Empatica e4 wearable sensor device was utilized to record several physiological signals, such as photoplethysmography (PPG), electrodermal activity (EDA), and body temperature. Concurrently, the process of capturing facial expressions was conducted using high-definition cameras during periods dedicated to viewing advertisements. The emotional assessments of the participants were evaluated utilizing an emotion appraisal scale, while demographic information was gathered via questionnaires. The utilization of this multifaceted dataset allows scholars to explore the complex realm of consumer decision-making processes, taking into account variables such as age, gender, and varied cultural backgrounds. Through the integration of physiological signals and facial expressions, the dataset offers valuable insights into the underlying neurological mechanisms that drive emotional responses toward branded commercials. This can be utilized by researchers to study emotional patterns, investigate correlations between consumers and advertising, and develop customized neuromarketing methods that are beneficial for individual preferences.

NeuroBioSense Dataset

Published: 20 December 2023 | Version 2 | DOI: 10.17632/7md7yty9sk.2

Dataset Overview:

This dataset provides valuable insights into emotional responses within the context of neuromarketing, based on data collected from 58 participants aged 18-70. Participants viewed 35 branding advertisements across three categories:

Cosmetics & Fashion

Car & Technology

Food & Market

Data Collection:

The data was collected using the Empatica e4 wearable sensor to measure physiological signals such as:

Photoplethysmography (PPG)

Electrodermal Activity (EDA)

Body Temperature

Simultaneously, high-definition cameras captured facial expressions during the advertisement viewing periods. The emotional reactions of the participants were assessed using an emotion appraisal scale, and demographic information was gathered through questionnaires.

Key Features:

Physiological Data: Signals like PPG, EDA, and body temperature for emotional analysis.

Facial Expression Data: Captured during advertisement viewing to assess emotional reactions.

Demographic Information: Age, gender, and cultural background of participants.

Emotion Appraisal: Evaluations of emotional responses to advertisements.

Applications:

Neuromarketing research

Studying consumer decision-making and emotional patterns

Developing personalized advertising strategies

Exploring correlations between emotions, consumer preferences, and demographics

This dataset is particularly useful for researchers looking to understand emotional patterns, identify emotional triggers in advertising, and explore how consumer preferences can be influenced by physiological and facial expression data.

Citation

NeuroBioSense Dataset: Kocaçınar, B., İnan, P., Zamur, E. N., Çalşimşek, B., Akbulut, F. P., & Catal, C. (2024). NeuroBioSense: A multidimensional dataset for neuromarketing analysis. Data in Brief, 53, 110235.
The EDA result of Climate Change is Concern.
plos.figshare.com
figshare.com
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yang Li; Yuqing Sun; Nana Zhu (2023). The EDA result of Climate Change is Concern. [Dataset]. http://doi.org/10.1371/journal.pone.0257130.t001
Explore at:
Unique identifier
https://doi.org/10.1371/journal.pone.0257130.t001
Dataset updated
Jun 3, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yang Li; Yuqing Sun; Nana Zhu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The EDA result of Climate Change is Concern.
e
Electrodermal Activity (EDA) of Bi-cultural Visitors In Virtual Park...
envidat.ch
data.europa.eu
not available, zip
Updated Nov 17, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahsa Bazrafshan; Reto Spielhofer; Ulrike Wissen Hayek; Felix Kienast; Adrienne Grêt-Regamey (2022). Electrodermal Activity (EDA) of Bi-cultural Visitors In Virtual Park Settings [Dataset]. http://doi.org/10.16904/envidat.356
Explore at:
zip, not availableAvailable download formats
Unique identifier
https://doi.org/10.16904/envidat.356
Dataset updated
Nov 17, 2022
Dataset provided by
ETH Zürich
Swiss Federal Institute for Forest, Snow and Landscape Research WSL
Authors
Mahsa Bazrafshan; Reto Spielhofer; Ulrike Wissen Hayek; Felix Kienast; Adrienne Grêt-Regamey
Time period covered
Oct 1, 2020 - Oct 30, 2020
Area covered
Switzerland
Dataset funded by
ETH Zürich / PLUS
WSL
State Secretariat for Education, Research and Innovation SERI
Description
This repository contains data on EDA measurements of visitors with different cultural backgrounds in virtual urban park settings. The parks are a Persian garden (Shiraz, Iran) and a historical park in Zurich, Switzerland. The cultural background of the visitors is Persian and Central European. The repository contains raw data from EDA, processed time series and statistical procedures.
Thames Water Daily Sewage Overflow Time Series
kaggle.com
Updated Jan 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tom Wright-Anderson (2024). Thames Water Daily Sewage Overflow Time Series [Dataset]. https://www.kaggle.com/datasets/thomaswrightanderson/thames-water-daily-sewage-spills/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 23, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Tom Wright-Anderson
Description
Context

Thames Water is one the UK's largest water providers, serving over 15 million customers in South East England. Like other UK water providers, in recent years they have come under increasing fire over their sewage overflow pollution, occurring as a result of: ageing wastewater infrastructure and increasing customer demand. As part of a billion pound wastewater improvement and transparency scheme, Thames Water also became the first UK water provider to release real-time data on sewage overflows available here

Motivation

As an avid outdoors person, wild swimmer and supporter of the surfers against sewage charity, I believe it is extremely important for information on this issue to be publicly available. In the critical mission of reducing sewage pollution, transparency and access to publicly available data are paramount for water companies like Thames Water. By openly sharing information about their wastewater management practices, discharge levels, and environmental impact, these companies foster accountability and public awareness within the communities they serve and enable collaborative efforts to address sewage pollution effectively.

How to use

An aggregated version at daily temporal resolution, which has been combined with precipitation data from around the catchment. Explore trends over time and attempt to model future sewage overflow durations given precipitation conditions.

A deaggregated version, where each row is a unique sewage overflow spill event. Events continuing overnight are split at midnight onto a new row. These data also includes the locations of spill events, making for some interesting geospatial analysis possibilities. If you have your own data to append other ways of processing the data, this is the dataset to do it on.

I hope you all find this data as interesting as I have to process and explore! I will upload an EDA and modelling notebook soon.

I strongly recommend looking into the Surfers against Sewage charity if you are interested in supporting the cause to reduce sewage pollution and increase awareness!
f
Wilcoxon signed-rank test for G-set dataset.
plos.figshare.com
xls
Updated Dec 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md. Rafiqul Islam; Md. Shahidul Islam; Pritam Khan Boni; Aldrin Saurov Sarker; Md. Asif Anam (2024). Wilcoxon signed-rank test for G-set dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0315842.t018
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0315842.t018
Dataset updated
Dec 30, 2024
Dataset provided by
PLOS ONE
Authors
Md. Rafiqul Islam; Md. Shahidul Islam; Pritam Khan Boni; Aldrin Saurov Sarker; Md. Asif Anam
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The objective of the max-cut problem is to cut any graph in such a way that the total weight of the edges that are cut off is maximum in both subsets of vertices that are divided due to the cut of the edges. Although it is an elementary graph partitioning problem, it is one of the most challenging combinatorial optimization-based problems, and tons of application areas make this problem highly admissible. Due to its admissibility, the problem is solved using the Harris Hawk Optimization algorithm (HHO). Though HHO effectively solved some engineering optimization problems, is sensitive to parameter settings and may converge slowly, potentially getting trapped in local optima. Thus, HHO and some additional operators are used to solve the max-cut problem. Crossover and refinement operators are used to modify the fitness of the hawk in such a way that they can provide precise results. A mutation mechanism along with an adjustment operator has improvised the outcome obtained from the updated hawk. To accept the potential result, the acceptance criterion has been used, and then the repair operator is applied in the proposed approach. The proposed system provided comparatively better outcomes on the G-set dataset than other state-of-the-art algorithms. It obtained 533 cuts more than the discrete cuckoo search algorithm in 9 instances, 1036 cuts more than PSO-EDA in 14 instances, and 1021 cuts more than TSHEA in 9 instances. But for four instances, the cuts are lower than PSO-EDA and TSHEA. Besides, the statistical significance has also been tested using the Wilcoxon signed rank test to provide proof of the superior performance of the proposed method. In terms of solution quality, MC-HHO can produce outcomes that are quite competitive when compared to other related state-of-the-art algorithms.
Flipkart Phone Listing
kaggle.com
Updated Apr 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shivam Amrutkar (2024). Flipkart Phone Listing [Dataset]. https://www.kaggle.com/datasets/yesshivam007/flipkart-phone-listing
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 16, 2024
Dataset provided by
Kaggle
Authors
Shivam Amrutkar
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Flipkart India Phone Listings as of May 2024

Data Scrapped using Data Miner for the following details: - Device Name - Star Rating - Storage - Display - Price

Data is intentionally left uncleaned to be used for cleaning and EDA processes.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

EDA augmentation parameters. [Dataset]. https://plos.figshare.com/articles/dataset/EDA_augmentation_parameters_/27112619

EDA augmentation parameters.

Explore at:

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0310707.t009

Dataset updated

Sep 26, 2024

Dataset provided by

PLOS ONE

Authors

Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.

Clear search

Close search

Google apps

Main menu

EDA augmentation parameters.

Exploratory Data Analysis (EDA) for COVIND-19

Impact of AI in Education Processes

The Global EDA Market size was USD 14.9 billion in 2023!

Preventive Maintenance for Marine Engines

PhysioPain Dataset

Credit Card Approval Prediction

A Credit Card Dataset for Machine Learning!

Context

Task

Content & Explanation

CERRA sub-daily regional reanalysis data for Europe on single levels from...

McKinsey Solve Assessment Data (2018–2025)

McKinsey Solve Global Assessment Dataset (2018–2025)

🧠 Context

📌 Inspiration & Purpose

🔍 Dataset Source

🧾 Dataset Structure

✅ Why Use This Dataset

💡 Credit & Collaboration

SentiGAN configuration parameters.

NeuroBioSense Dataset

Description

NeuroBioSense Dataset

Dataset Overview:

Data Collection:

Key Features:

Applications:

Citation

The EDA result of Climate Change is Concern.

Electrodermal Activity (EDA) of Bi-cultural Visitors In Virtual Park...

Thames Water Daily Sewage Overflow Time Series

Wilcoxon signed-rank test for G-set dataset.

Flipkart Phone Listing

EDA augmentation parameters.See More Versions

EDA augmentation parameters.