Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.
Description: The COVID-19 dataset used for this EDA project encompasses comprehensive data on COVID-19 cases, deaths, and recoveries worldwide. It includes information gathered from authoritative sources such as the World Health Organization (WHO), the Centers for Disease Control and Prevention (CDC), and national health agencies. The dataset covers global, regional, and national levels, providing a holistic view of the pandemic's impact.
Purpose: This dataset is instrumental in understanding the multifaceted impact of the COVID-19 pandemic through data exploration. It aligns perfectly with the objectives of the EDA project, aiming to unveil insights, patterns, and trends related to COVID-19. Here are the key objectives: 1. Data Collection and Cleaning: • Gather reliable COVID-19 datasets from authoritative sources (such as WHO, CDC, or national health agencies). • Clean and preprocess the data to ensure accuracy and consistency. 2. Descriptive Statistics: • Summarize key statistics: total cases, recoveries, deaths, and testing rates. • Visualize temporal trends using line charts, bar plots, and heat maps. 3. Geospatial Analysis: • Map COVID-19 cases across countries, regions, or cities. • Identify hotspots and variations in infection rates. 4. Demographic Insights: • Explore how age, gender, and pre-existing conditions impact vulnerability. • Investigate disparities in infection rates among different populations. 5. Healthcare System Impact: • Analyze hospitalization rates, ICU occupancy, and healthcare resource allocation. • Assess the strain on medical facilities. 6. Economic and Social Effects: • Investigate the relationship between lockdown measures, economic indicators, and infection rates. • Explore behavioral changes (e.g., mobility patterns, remote work) during the pandemic. 7. Predictive Modeling (Optional): • If data permits, build simple predictive models (e.g., time series forecasting) to estimate future cases.
Data Sources: The primary sources of the COVID-19 dataset include the Johns Hopkins CSSE COVID-19 Data Repository, Google Health’s COVID-19 Open Data, and the U.S. Economic Development Administration (EDA). These sources provide reliable and up-to-date information on COVID-19 cases, deaths, testing rates, and other relevant variables. Additionally, GitHub repositories and platforms like Medium host supplementary datasets and analyses, enriching the available data resources.
Data Format: The dataset is available in various formats, such as CSV and JSON, facilitating easy access and analysis. Before conducting the EDA, the data underwent preprocessing steps to ensure accuracy and consistency. Data cleaning procedures were performed to address missing values, inconsistencies, and outliers, enhancing the quality and reliability of the dataset.
License: The COVID-19 dataset may be subject to specific usage licenses or restrictions imposed by the original data sources. Proper attribution is essential to acknowledge the contributions of the WHO, CDC, national health agencies, and other entities providing the data. Users should adhere to any licensing terms and usage guidelines associated with the dataset.
Attribution: We acknowledge the invaluable contributions of the World Health Organization (WHO), the Centers for Disease Control and Prevention (CDC), national health agencies, and other authoritative sources in compiling and disseminating the COVID-19 data used for this EDA project. Their efforts in collecting, curating, and sharing data have been instrumental in advancing our understanding of the pandemic and guiding public health responses globally.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We did data analysis on a open dataset which contained responses regarding a survey about how useful students find AI in the educational process. We cleaned the data, preprocessed and then did analysis on it. We did an EDA (Exploratory Data Analysis) on the dataset and visualized the results and our findings. Then we interpreted the findings into our digital poster.
https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy
According to Cognitive Market Research, The Global EDA Market size will be USD 14.9 billion in 2023 and will grow at a compound annual growth rate (CAGR) of 10.50% from 2023 to 2030.
The demand for the EDA Market is rising due to the rise in outdoor and adventure activities.
Changing consumer lifestyle trends are higher in the EDA market.
The cat segment held the highest EDA Market revenue share in 2023.
North American EDA will continue to lead, whereas the European EDA Market will experience the most substantial growth until 2030.
Supply Chain and Risk Analysis to Provide Viable Market Output
The industry is facing supply chain and logistics disruptions. EDA tools have been instrumental in analyzing supply chain data, identifying vulnerabilities, predicting risks, and developing disruption mitigation strategies. Consumer behavior has undergone drastic changes due to blockages and restrictions. EDA helps companies analyze changing trends in buying behavior, online shopping preferences, and demand patterns, enabling organizations to adjust their marketing and sales strategies accordingly.
Health and Pharmaceutical Research to Propel Market Growth.
EDA tools have played a key role in analyzing large amounts of data related to vaccine development, drug trials, patient records and epidemiological studies. These tools have helped researchers process and interpret complex medical data, leading to advances in the development of treatments and vaccines. The pandemic has created challenges in data collection, especially in sectors affected by lockdowns or blackouts. Rapidly changing conditions and incomplete data sets make effective EDA difficult due to data quality issues. The economic uncertainty caused by the pandemic has led to budget cuts in some sectors, impacting investment in new technologies. Some organizations have limited budgets that limit their ability to adopt or update EDA tools.
Market Dynamics of the EDA
Privacy and Data Security Issues to Restrict Market Growth.
With the focus on data privacy regulations such as GDPR, CCPA, etc., organizations need to ensure compliance when handling sensitive data. These compliance requirements may limit the scope of the EDA by limiting the availability and use of certain data sets for information analysis. EDA often requires data analysts or data scientists who are skilled in statistical analysis and data visualization tools. A lack of professionals with these specialized skills can hinder an organization's ability to use EDA tools effectively, limiting adoption. Advanced EDA techniques can involve complex algorithms and statistical techniques that are difficult for non-technical users to understand. Interpreting results and deriving actionable insights from EDA results pose challenges that affect applicability to a wider audience.
Key Opportunity of market.
Growing miniaturization in various industries can be an opportunity.
With the age of highly advanced electronics, miniaturization has become a trend that enabled organizations across diverse sectors such as healthcare, consumer electronics, aerospace and defense, automotive and others to design miniature electronic devices. The devices incorporate miniaturized semiconductor components, e.g., surgical instruments and blood glucose meters in healthcare, fitness bands in wearable devices, automotive modules in the automotive sector, and intelligent baggage labels. Miniaturization has a number of advantages such as freeing space for other features and better batteries. The increased consciousness among consumers towards fitness is fueling the demand for smaller fitness devices such as smartwatches and fitness trackers. This is motivating companies to come up with innovative products with improved features, while researchers are concentrating on cost-effective and efficient product development through electronic design tools. Besides, use of portable equipment has gained immense popularity among media professionals because of the increasing demand for live reporting of different events like riots, accidents, sports, and political rallies. As a result of the inconvenience in the use of cumbersome TV production vans to access such events, demand for portable handheld equipment has risen. Such devices are simply portable and can be quickly moved to the event venue if carried in backpacks. Therefore, the need for compact devices across various indust...
Preventive Maintenance for Marine Engines: Data-Driven Insights
Introduction:
Marine engine failures can lead to costly downtime, safety risks and operational inefficiencies. This project leverages machine learning to predict maintenance needs, helping ship operators prevent unexpected breakdowns. Using a simulated dataset, we analyze key engine parameters and develop predictive models to classify maintenance status into three categories: Normal, Requires Maintenance, and Critical.
Overview This project explores preventive maintenance strategies for marine engines by analyzing operational data and applying machine learning techniques.
Key steps include: 1. Data Simulation: Creating a realistic dataset with engine performance metrics. 2. Exploratory Data Analysis (EDA): Understanding trends and patterns in engine behavior. 3. Model Training & Evaluation: Comparing machine learning models (Decision Tree, Random Forest, XGBoost) to predict maintenance needs. 4. Hyperparameter Tuning: Using GridSearchCV to optimize model performance.
Tools Used 1. Python: Data processing, analysis and modeling 2. Pandas & NumPy: Data manipulation 3. Scikit-Learn & XGBoost: Machine learning model training 4. Matplotlib & Seaborn: Data visualization
Skills Demonstrated ✔ Data Simulation & Preprocessing ✔ Exploratory Data Analysis (EDA) ✔ Feature Engineering & Encoding ✔ Supervised Machine Learning (Classification) ✔ Model Evaluation & Hyperparameter Tuning
Key Insights & Findings 📌 Engine Temperature & Vibration Level: Strong indicators of potential failures. 📌 Random Forest vs. XGBoost: After hyperparameter tuning, both models achieved comparable performance, with Random Forest performing slightly better. 📌 Maintenance Status Distribution: Balanced dataset ensures unbiased model training. 📌 Failure Modes: The most common issues were Mechanical Wear & Oil Leakage, aligning with real-world engine failure trends.
Challenges Faced 🚧 Simulating Realistic Data: Ensuring the dataset reflects real-world marine engine behavior was a key challenge. 🚧 Model Performance: The accuracy was limited (~35%) due to the complexity of failure prediction. 🚧 Feature Selection: Identifying the most impactful features required extensive analysis.
Call to Action 🔍 Explore the Dataset & Notebook: Try running different models and tweaking hyperparameters. 📊 Extend the Analysis: Incorporate additional sensor data or alternative machine learning techniques. 🚀 Real-World Application: This approach can be adapted for industrial machinery, aircraft engines, and power plants.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Pain is considered subjective since it is based on personal experience; however, it may be possible to objectively analyze pain through a self-reporting system supported by bio-signals. Based on this hypothesis, a multimodal dataset is created, combining EEG and wristband signals (including EDA, BVP, temperature, accelerometer data) along with participants' responses to a survey, including the McGill Pain Questionnaire. This dataset, collected from 99 participants, allows for the analysis of three different types of pain: headache, back pain, and menstrual pain. E4 Empatica wristband and Mindwave Mobile 2 EEG device are used to collect the data. Both raw and processed data of the devices with survey answers are stored in the dataset by participants' unique IDs.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Don't ask me where this data come from, the answer is I don't know!
Credit score cards are a common risk control method in the financial industry. It uses personal information and data submitted by credit card applicants to predict the probability of future defaults and credit card borrowings. The bank is able to decide whether to issue a credit card to the applicant. Credit scores can objectively quantify the magnitude of risk.
Generally speaking, credit score cards are based on historical data. Once encountering large economic fluctuations. Past models may lose their original predictive power. Logistic model is a common method for credit scoring. Because Logistic is suitable for binary classification tasks and can calculate the coefficients of each feature. In order to facilitate understanding and operation, the score card will multiply the logistic regression coefficient by a certain value (such as 100) and round it.
At present, with the development of machine learning algorithms. More predictive methods such as Boosting, Random Forest, and Support Vector Machines have been introduced into credit card scoring. However, these methods often do not have good transparency. It may be difficult to provide customers and regulators with a reason for rejection or acceptance.
Build a machine learning model to predict if an applicant is 'good' or 'bad' client, different from other tasks, the definition of 'good' or 'bad' is not given. You should use some techique, such as vintage analysis to construct you label. Also, unbalance data problem is a big problem in this task.
There're two tables could be merged by ID
:
application_record.csv | ||
---|---|---|
Feature name | Explanation | Remarks |
ID | Client number | |
CODE_GENDER | Gender | |
FLAG_OWN_CAR | Is there a car | |
FLAG_OWN_REALTY | Is there a property | |
CNT_CHILDREN | Number of children | |
AMT_INCOME_TOTAL | Annual income | |
NAME_INCOME_TYPE | Income category | |
NAME_EDUCATION_TYPE | Education level | |
NAME_FAMILY_STATUS | Marit... |
https://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/licence-to-use-copernicus-products/licence-to-use-copernicus-products_b4b9451f54cffa16ecef5c912c9cebd6979925a956e3fa677976e0cf198c2c18.pdfhttps://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/licence-to-use-copernicus-products/licence-to-use-copernicus-products_b4b9451f54cffa16ecef5c912c9cebd6979925a956e3fa677976e0cf198c2c18.pdf
The Copernicus European Regional ReAnalysis (CERRA) datasets provide spatially and temporally consistent historical reconstructions of meteorological variables in the atmosphere and at the surface. There are four subsets: single levels (atmospheric and surface quantities), height levels (upper-air fields up to 500m), pressure levels (upper-air fields up to 1hPa) and model levels (native levels of the model). This entry provides reanalysis and forecast data on single levels for Europe from 1984 to present. Several atmospheric parameters are common to both reanalysis and forecast (e.g. temperature, wind), whilst others are produced only by the forecast model (e.g. 10m wind gust, radiative fluxes). Reanalysis combines model data with observations into a complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved, reprocessed, versions of the original observations, which all benefit the quality of the reanalysis product. The CERRA dataset was produced using the HARMONIE-ALADIN limited-area numerical weather prediction and data assimilation system, hereafter referred to as the CERRA system. The CERRA system employs a 3-dimensional variational data assimilation scheme of the atmospheric state at every assimilation time. The reanalysis dataset is convenient owing to its provision of atmospheric estimates at each model domain grid point over Europe for each regular output time, over a long period, and always using the same data format. The inputs to CERRA reanalysis are the observational data, lateral boundary conditions from ERA5 global reanalysis as prior estimates of the atmospheric state and physiographic datasets describing the surface characteristics of the model. The observing system has evolved over time, and although the data assimilation system can resolve data holes, the much sparser observational networks in the past periods (for example a reduced amount of satellite data in the 1980s) can impact the quality of analyses leading to less accurate estimates. The uncertainty estimates for reanalysis variables are provided by the CERRA-EDA, a 10-member ensemble of data assimilation system. The added value of the CERRA data with respect to the global reanalysis products is expected to come, for example, with the higher horizontal resolution that permits the usage of a better description of the model topography and physiographic data, and the assimilation of more surface observations. More information about the CERRA dataset can be found in the Documentation section.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
McKinsey's Solve is a gamified problem-solving assessment used globally in the consulting firm’s recruitment process. This dataset simulates assessment results across geographies, education levels, and roles over a 7-year period. It aims to provide deep insights into performance trends, candidate readiness, resume quality, and cognitive task outcomes.
Inspired by McKinsey’s real-world assessment framework, this dataset was designed to enable: - Exploratory Data Analysis (EDA) - Recruitment trend analysis - Gamified performance modelling - Dashboard development in Excel / Power BI - Resume and education impact evaluation - Regional performance benchmarking - Data storytelling for portfolio projects
Whether you're building dashboards or training models, this dataset offers practical and relatable data for HR analytics and consulting use cases.
This dataset includes 4,000 rows and the following columns: - Testtaker ID: Unique identifier - Country / Region: Geographic segmentation - Gender / Age: Demographics - Year: Assessment year (2018–2025) - Highest Level of Education: From high school to PhD / MBA - School or University Attended: Mapped to country and education level - First-generation University Student: Yes/No - Employment Status: Student, Employed, Unemployed - Role Applied For and Department / Interest: Business/tech disciplines - Past Test Taker: Indicates repeat attempts - Prepared with Online Materials: Indicates test prep involvement - Desired Office Location: Mapped to McKinsey's international offices - Ecosystem / Redrock / Seawolf (%): Game performance scores - Time Spent on Each Game (mins) - Total Product Score: Average of the 3 game scores - Process Score: A secondary assessment component - Resume Score: Scored based on education prestige, role fit, and clarity - Total Assessment Score (%): Final decision metric - Status (Pass/Fail): Based on total score ≥ 75%
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset provided in this study serves as a helpful resource for comprehending emotional reactions within the domain of neuromarketing. Data was acquired from a sample of 58 participants, ranging in age from 18 to 70, through the implementation of a carefully planned experimental design. The participants were shown a collection of 35 branding advertisements classified into three categories: cosmetics and fashion, car and technology, and food and market. The Empatica e4 wearable sensor device was utilized to record several physiological signals, such as photoplethysmography (PPG), electrodermal activity (EDA), and body temperature. Concurrently, the process of capturing facial expressions was conducted using high-definition cameras during periods dedicated to viewing advertisements. The emotional assessments of the participants were evaluated utilizing an emotion appraisal scale, while demographic information was gathered via questionnaires. The utilization of this multifaceted dataset allows scholars to explore the complex realm of consumer decision-making processes, taking into account variables such as age, gender, and varied cultural backgrounds. Through the integration of physiological signals and facial expressions, the dataset offers valuable insights into the underlying neurological mechanisms that drive emotional responses toward branded commercials. This can be utilized by researchers to study emotional patterns, investigate correlations between consumers and advertising, and develop customized neuromarketing methods that are beneficial for individual preferences.
Published: 20 December 2023 | Version 2 | DOI: 10.17632/7md7yty9sk.2
This dataset provides valuable insights into emotional responses within the context of neuromarketing, based on data collected from 58 participants aged 18-70. Participants viewed 35 branding advertisements across three categories:
The data was collected using the Empatica e4 wearable sensor to measure physiological signals such as:
Simultaneously, high-definition cameras captured facial expressions during the advertisement viewing periods. The emotional reactions of the participants were assessed using an emotion appraisal scale, and demographic information was gathered through questionnaires.
This dataset is particularly useful for researchers looking to understand emotional patterns, identify emotional triggers in advertising, and explore how consumer preferences can be influenced by physiological and facial expression data.
NeuroBioSense Dataset: Kocaçınar, B., İnan, P., Zamur, E. N., Çalşimşek, B., Akbulut, F. P., & Catal, C. (2024). NeuroBioSense: A multidimensional dataset for neuromarketing analysis. Data in Brief, 53, 110235.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The EDA result of Climate Change is Concern.
This repository contains data on EDA measurements of visitors with different cultural backgrounds in virtual urban park settings. The parks are a Persian garden (Shiraz, Iran) and a historical park in Zurich, Switzerland. The cultural background of the visitors is Persian and Central European. The repository contains raw data from EDA, processed time series and statistical procedures.
Context
Thames Water is one the UK's largest water providers, serving over 15 million customers in South East England. Like other UK water providers, in recent years they have come under increasing fire over their sewage overflow pollution, occurring as a result of: ageing wastewater infrastructure and increasing customer demand. As part of a billion pound wastewater improvement and transparency scheme, Thames Water also became the first UK water provider to release real-time data on sewage overflows available here
Motivation
As an avid outdoors person, wild swimmer and supporter of the surfers against sewage charity, I believe it is extremely important for information on this issue to be publicly available. In the critical mission of reducing sewage pollution, transparency and access to publicly available data are paramount for water companies like Thames Water. By openly sharing information about their wastewater management practices, discharge levels, and environmental impact, these companies foster accountability and public awareness within the communities they serve and enable collaborative efforts to address sewage pollution effectively.
How to use
I hope you all find this data as interesting as I have to process and explore! I will upload an EDA and modelling notebook soon.
I strongly recommend looking into the Surfers against Sewage charity if you are interested in supporting the cause to reduce sewage pollution and increase awareness!
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The objective of the max-cut problem is to cut any graph in such a way that the total weight of the edges that are cut off is maximum in both subsets of vertices that are divided due to the cut of the edges. Although it is an elementary graph partitioning problem, it is one of the most challenging combinatorial optimization-based problems, and tons of application areas make this problem highly admissible. Due to its admissibility, the problem is solved using the Harris Hawk Optimization algorithm (HHO). Though HHO effectively solved some engineering optimization problems, is sensitive to parameter settings and may converge slowly, potentially getting trapped in local optima. Thus, HHO and some additional operators are used to solve the max-cut problem. Crossover and refinement operators are used to modify the fitness of the hawk in such a way that they can provide precise results. A mutation mechanism along with an adjustment operator has improvised the outcome obtained from the updated hawk. To accept the potential result, the acceptance criterion has been used, and then the repair operator is applied in the proposed approach. The proposed system provided comparatively better outcomes on the G-set dataset than other state-of-the-art algorithms. It obtained 533 cuts more than the discrete cuckoo search algorithm in 9 instances, 1036 cuts more than PSO-EDA in 14 instances, and 1021 cuts more than TSHEA in 9 instances. But for four instances, the cuts are lower than PSO-EDA and TSHEA. Besides, the statistical significance has also been tested using the Wilcoxon signed rank test to provide proof of the superior performance of the proposed method. In terms of solution quality, MC-HHO can produce outcomes that are quite competitive when compared to other related state-of-the-art algorithms.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Flipkart India Phone Listings as of May 2024
Data Scrapped using Data Miner for the following details: - Device Name - Star Rating - Storage - Display - Price
Data is intentionally left uncleaned to be used for cleaning and EDA processes.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.