20 datasets found

Z
Data Analysis for the Systematic Literature Review of DL4SE
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586
Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
Washington and Lee University
College of William and Mary
Authors
Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise
Raisin Variety Classification Dataset
kaggle.com
zip
Updated Oct 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Huseyin Cenik (2023). Raisin Variety Classification Dataset [Dataset]. https://www.kaggle.com/datasets/huseyincenik/raisin-dataset
Explore at:
zip(80973 bytes)Available download formats
Dataset updated
Oct 20, 2023
Authors
Huseyin Cenik
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F14886202%2Ff366b6f91d9b8e2f15be6354e2d42de1%2FCabernet_Sauvignon_Gaillac.jpg?generation=1722803447627177&alt=media" alt="">

Data Description

The dataset includes images of Kecimen and Besni raisin varieties grown in Turkey, with a total of 900 raisin grains, including 450 pieces from each variety. These images were captured using CVS and underwent various stages of pre-processing. A total of 7 morphological features were extracted from these images and classified using three different artificial intelligence techniques.

Data Fields:

Area: The number of pixels within the boundaries of the raisin.

Perimeter: The measurement of the boundary by calculating the distance around the raisin's edges.

MajorAxisLength: The length of the main axis, which is the longest line that can be drawn on the raisin.

MinorAxisLength: The length of the minor axis, which is the shortest line that can be drawn on the raisin.

Eccentricity: A measure of the eccentricity of the ellipse that has the same moments as the raisin.

ConvexArea: The number of pixels in the smallest convex shell encompassing the region formed by the raisin.

Extent: The ratio of the region formed by the raisin to the total pixels in the bounding box.

Class: The variety of raisin, either Kecimen or Besni.

Çinar,İ̇lkay, Koklu,Murat, and Tasdemir,Sakir. (2023). Raisin. UCI Machine Learning Repository. https://doi.org/10.24432/C5660T.
f
Detailed characterization of the dataset.
figshare.com
xls
Updated Sep 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda (2024). Detailed characterization of the dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0310707.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310707.t006
Dataset updated
Sep 26, 2024
Dataset provided by
PLOS ONE
Authors
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.
Classification with an Academic Success Dataset
kaggle.com
zip
Updated Jun 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MosesNzoo (2024). Classification with an Academic Success Dataset [Dataset]. https://www.kaggle.com/datasets/mosesnzoo/classification-with-an-academic-success-dataset
Explore at:
zip(137108 bytes)Available download formats
Dataset updated
Jun 28, 2024
Authors
MosesNzoo
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Classification with an Academic Success Dataset

Objective:

The goal of this project is to develop a classification model that can predict whether a student will pass based on various academic and demographic factors. By analyzing the provided dataset, we aim to identify key predictors of academic success and build a model that can help educational institutions improve student outcomes.

Dataset:

The dataset consists of 51,012 rows and includes the following columns:

student_id: Unique identifier for each student.

gender: Gender of the student (Male/Female).

age: Age of the student.

major: Major field of study (Engineering, Science, Arts, Business).

gpa: Grade Point Average (GPA) of the student.

study_hours: Average number of study hours per week.

extra_curricular: Participation in extracurricular activities (Yes/No).

attendance: Attendance percentage.

passed: Whether the student passed (Yes/No).

Tools and Techniques:

Data Preprocessing: Clean and preprocess the data to handle missing values, encode categorical variables, and normalize numerical features.

Exploratory Data Analysis (EDA): Perform EDA to understand the distribution of data and identify patterns and correlations.

Feature Selection: Select the most relevant features that contribute to the prediction of academic success.

Model Training: Train various classification models such as Logistic Regression, Decision Trees, Random Forest, and Support Vector Machines (SVM).

Model Evaluation: Evaluate the performance of the models using metrics such as accuracy, precision, recall, and F1-score.

Hyperparameter Tuning: Optimize the model's performance by tuning hyperparameters using techniques such as Grid Search or Random Search.

Model Interpretation: Interpret the model's predictions to provide actionable insights for educational institutions.

Outcome:

A trained classification model that can predict whether a student will pass based on the provided features. A detailed report outlining the steps taken in the analysis, the performance of different models, and the final model's evaluation metrics. Visualizations to illustrate the relationships between different features and the target variable. Recommendations for educational institutions based on the findings of the analysis.

Skills Required:

Data Preprocessing and Cleaning

Exploratory Data Analysis (EDA)

Feature Selection and Engineering

Classification Algorithms

Model Evaluation and Hyperparameter Tuning

Data Visualization

Model Interpretation

Expected Deliverables:

Cleaned and preprocessed dataset ready for modeling.

Trained classification model with optimized performance.

A comprehensive report detailing the analysis process and findings.

Visualizations and insights to aid in decision-making for educational institutions.
Table 1_Physiological parameters to support attention deficit hyperactivity...
frontiersin.figshare.com
docx
Updated Nov 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thais Castro Ribeiro; Esther García Pagès; Anna Huguet; Jose A. Alda; Llorenç Badiella; Jordi Aguiló (2024). Table 1_Physiological parameters to support attention deficit hyperactivity disorder diagnosis in children: a multiparametric approach.docx [Dataset]. http://doi.org/10.3389/fpsyt.2024.1430797.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fpsyt.2024.1430797.s001
Dataset updated
Nov 7, 2024
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Thais Castro Ribeiro; Esther García Pagès; Anna Huguet; Jose A. Alda; Llorenç Badiella; Jordi Aguiló
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionAttention deficit hyperactivity disorder (ADHD) is a high-prevalent neurodevelopmental disorder characterized by inattention, impulsivity, and hyperactivity, frequently co-occurring with other psychiatric and medical conditions. Current diagnosis is time-consuming and often delays effective treatment; to date, no valid biomarker has been identified to facilitate this process. Research has linked the core symptoms of ADHD to autonomic dysfunction resulting from impaired arousal modulation, which contributes to physiological abnormalities that may serve as useful biomarkers for the disorder. While recent research has explored alternative objective assessment tools, few have specifically focused on studying ADHD autonomic dysregulation through physiological parameters. This study aimed to design a multiparametric physiological model to support ADHD diagnosis.MethodsIn this observational study we non-invasively analyzed heart rate variability (HRV), electrodermal activity (EDA), respiration, and skin temperature parameters of 69 treatment-naïve ADHD children and 29 typically developing (TD) controls (7-12 years old). To identify the most relevant parameters to discriminate ADHD children from controls, we explored the physiological behavior at baseline and during a sustained attention task and applied a logistic regression procedure.ResultsADHD children showed increased HRV and lower EDA at baseline. The stress-inducing task elicits higher reactivity for EDA, pulse arrival time (PAT), and respiratory frequency in the ADHD group. The final classification model included 4 physiological parameters and was adjusted by gender and age. A good capacity to discriminate between ADHD children and TD controls was obtained, with an accuracy rate of 85.5% and an AUC of 0.95.DiscussionOur findings suggest that a multiparametric physiological model constitutes an accurate tool that can be easily employed to support ADHD diagnosis in clinical practice. The discrimination capacity of the model may be analyzed in larger samples to confirm the possibility of generalization.
R
Cdd Dataset
universe.roboflow.com
zip
Updated Sep 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
hakuna matata (2023). Cdd Dataset [Dataset]. https://universe.roboflow.com/hakuna-matata/cdd-g8a6g/model/3
Explore at:
zipAvailable download formats
Dataset updated
Sep 5, 2023
Dataset authored and provided by
hakuna matata
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Cumcumber Diease Detection Bounding Boxes
Description
Project Documentation: Cucumber Disease Detection

Title and Introduction Title: Cucumber Disease Detection

Introduction: A machine learning model for the automatic detection of diseases in cucumber plants is to be developed as part of the "Cucumber Disease Detection" project. This research is crucial because it tackles the issue of early disease identification in agriculture, which can increase crop yield and cut down on financial losses. To train and test the model, we use a dataset of pictures of cucumber plants.

Problem Statement Problem Definition: The research uses image analysis methods to address the issue of automating the identification of diseases, including Downy Mildew, in cucumber plants. Effective disease management in agriculture depends on early illness identification.

Importance: Early disease diagnosis helps minimize crop losses, stop the spread of diseases, and better allocate resources in farming. Agriculture is a real-world application of this concept.

Goals and Objectives: Develop a machine learning model to classify cucumber plant images into healthy and diseased categories. Achieve a high level of accuracy in disease detection. Provide a tool for farmers to detect diseases early and take appropriate action.

Data Collection and Preprocessing Data Sources: The dataset comprises of pictures of cucumber plants from various sources, including both healthy and damaged specimens.

Data Collection: Using cameras and smartphones, images from agricultural areas were gathered.

Data Preprocessing: Data cleaning to remove irrelevant or corrupted images. Handling missing values, if any, in the dataset. Removing outliers that may negatively impact model training. Data augmentation techniques applied to increase dataset diversity.

Exploratory Data Analysis (EDA) The dataset was examined using visuals like scatter plots and histograms. The data was examined for patterns, trends, and correlations. Understanding the distribution of photos of healthy and ill plants was made easier by EDA.

Methodology Machine Learning Algorithms:

Convolutional Neural Networks (CNNs) were chosen for image classification due to their effectiveness in handling image data. Transfer learning using pre-trained models such as ResNet or MobileNet may be considered. Train-Test Split:

The dataset was split into training and testing sets with a suitable ratio. Cross-validation may be used to assess model performance robustly.

Model Development The CNN model's architecture consists of layers, units, and activation operations. On the basis of experimentation, hyperparameters including learning rate, batch size, and optimizer were chosen. To avoid overfitting, regularization methods like dropout and L2 regularization were used.

Model Training During training, the model was fed the prepared dataset across a number of epochs. The loss function was minimized using an optimization method. To ensure convergence, early halting and model checkpoints were used.

Model Evaluation Evaluation Metrics:

Accuracy, precision, recall, F1-score, and confusion matrix were used to assess model performance. Results were computed for both training and test datasets. Performance Discussion:

The model's performance was analyzed in the context of disease detection in cucumber plants. Strengths and weaknesses of the model were identified.

Results and Discussion Key project findings include model performance and disease detection precision. a comparison of the many models employed, showing the benefits and drawbacks of each. challenges that were faced throughout the project and the methods used to solve them.

Conclusion recap of the project's key learnings. the project's importance to early disease detection in agriculture should be highlighted. Future enhancements and potential research directions are suggested.

References Library: Pillow,Roboflow,YELO,Sklearn,matplotlib Datasets:https://data.mendeley.com/datasets/y6d3z6f8z9/1

Code Repository https://universe.roboflow.com/hakuna-matata/cdd-g8a6g

Rafiur Rahman Rafit EWU 2018-3-60-111
Bank Fraud Dataset
kaggle.com
zip
Updated Aug 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Borna B (2023). Bank Fraud Dataset [Dataset]. https://www.kaggle.com/datasets/mohammadbolandraftar/my-dataset
Explore at:
zip(11896802 bytes)Available download formats
Dataset updated
Aug 28, 2023
Authors
Borna B
Description
The dataset has one training dataset, one testing (unseen) dataset, which is unlabeled, and a clickstream dataset, all interconnected through a common identifier known as "SESSION_ID." This identifier allows us to link user actions across the datasets. A session involves client online banking activities like signing in, updating passwords, viewing products, or adding items to the cart.

Majority of fraud cases add new shipping address, or change password. you can do visualization to get more insights about the nature of frauds.

I also added 2 datasets named "train/test_dataset_combined" which are the merged version of the train and test datasets based on the "SESSION_ID" column. For more information, please refer to this link: https://www.kaggle.com/code/mohammadbolandraftar/combine-datasets-in-pandas

In addition, I added the cleaned dataset after doing EDA. For more information about the EDA process, please refer to this link: https://www.kaggle.com/code/mohammadbolandraftar/a-deep-dive-into-fraud-detection-through-eda
mixed-data-32x92x3-
kaggle.com
zip
Updated Apr 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mustafa Keser (2023). mixed-data-32x92x3- [Dataset]. https://www.kaggle.com/datasets/mustafakeser4/mixed-data-32x92x3-
Explore at:
zip(2194016700 bytes)Available download formats
Dataset updated
Apr 21, 2023
Authors
Mustafa Keser
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
GISL mixed_data_gen

Thanks

I would like to express my gratitude to @markwijkhuizen and @dschettler8845 for sharing their notebook and dataset on Kaggle.

used datasets

https://www.kaggle.com/datasets/markwijkhuizen/gislr-dataset-public/versions/1

https://www.kaggle.com/datasets/dschettler8845/gislr-extended-train-dataframe

used notebooks

https://www.kaggle.com/code/markwijkhuizen/gislr-tf-data-processing-transformer-training

https://www.kaggle.com/code/dschettler8845/gislr-learn-eda-baseline

about dataset code:

dataset notebook

https://www.kaggle.com/mustafakeser4/mixed-data-gen

This dataset's code appears to be preparing data for creating mixed data from two different samples and saving them as numpy arrays.

It starts by grouping a dataframe train_df by two columns, total_frames and sign. Then it applies a lambda function to retrieve the index of each group, and filters the indices to only include groups with more than one index.

The code then loops over each filtered group, and for each group it loops over the indices and creates a mixed array of two samples with different frames. It saves each mixed array as a numpy file in a directory.

Finally, the code loads the numpy files from the directory and creates three numpy arrays X_ms, labels, and noneidxs. It saves these arrays as numpy files in a different directory. These arrays represent the mixed data, their labels, and the indices of non-empty frames in the mixed data.

dataset notebook

https://www.kaggle.com/mustafakeser4/mixed-data-gen
Electrodermal Activity and Electrocardiogram during submaximal exercise test...
figshare.com
txt
Updated Feb 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugo Posada-Quintero; Natasa Reljin (2022). Electrodermal Activity and Electrocardiogram during submaximal exercise test [Dataset]. http://doi.org/10.6084/m9.figshare.5925322.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5925322.v1
Dataset updated
Feb 2, 2022
Dataset provided by
Figsharehttp://figshare.com/
Authors
Hugo Posada-Quintero; Natasa Reljin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Subjects for whom exercise represents a low risk level, based on standardized guidelines from the American College of Sports Medicine (ACSM) [20], were asked to participate in the study. Eighteen healthy subjects, 11 males and 7 females, age 21 ± 3 years were enrolled. Participants were asked to avoid caffeine and alcohol during the 48 hours preceding the test, and were instructed to fast (water only) for at least 3 h before testing. The study was conducted in a quiet, comfortable room (ambient temperature, 18-20 °C, and relative humidity between 30-50%). All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. Informed consent was obtained from all individual participants included in the study. This protocol was approved by the Institutional Review Board of the University of Connecticut.

Before the exercise test began, the subjects were asked to lay in the supine position for 5 min to procure hemodynamic stabilization prior to 5 minutes of data collection in this position. ECG and EDA were measured simultaneously for each subject throughout the entire experiment. The ECG signal was used to monitor subjects’ HR throughout the experiment. An HP ECG monitor (HP 78354A) and GSR ADInstruments module were used. Three hydrogel Ag-AgCl electrodes were used for ECG signal collection. The electrodes were placed on the shoulders and lower left rib. In addition, a pair of stainless steel electrodes were placed on index and middle fingers of the right hand to collect the EDA signal. Subjects were instructed to keep their right hand stable, raised at chest height. The skin was cleaned with alcohol before placing the ECG and EDA electrodes. The leads were taped to the subject’s skin using latex-free tape, to avoid movement of the cables, which can corrupt the signals. All signals were acquired through the ADInstruments analog-to-digital converter, and compatible PowerLab software, while the sampling frequency was fixed to 400 Hz for all signals. Participants were asked to wear their own active wear/gym clothes during the protocol with the shirt covering the electrodes and cables during the experiment.Subjects were first monitored for 5 min at rest (supine, without any movement or talking) to measure resting HR and EDA. The subjects then performed the incremental test on a motorized treadmill (Life Fitness F3). 85% HRmax was calculated from the equation HRmax = 206.9-(0.67*age).The incremental running began with an initial warm-up, followed by walking at 3mi/h (~ 4.82 km/h). The speed was increased to 5 mi/h (~ 8 km/h) and increased 0.6 mi/h (about 1 km/h) every subsequent minute until the subjects reached 85% of their HRmax. When a subject reached 85% of HRmax within 2 min of running, the data were excluded because at least 2 minutes of data are required for processing. The 18 subjects enrolled for this study represents those who were able to provide at least 2 minutes of data prior to reaching 85% of HRmax. After subjects reached 85% of their HRmax, treadmill speed was reduced to 5 mi/h (~ 8 km/h) for another 4 min to start the recovery phase, followed by walking at 3 mi/h (about 4.82 km/h) for 5 minutes. A final 10 min period (or more if needed to achieve baseline HR) in the supine position was utilized to allow HR to return to baseline. The duration of the experiment was approximately one hour.
Data Science Job Postings (Indeed USA)
kaggle.com
zip
Updated Dec 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yusuf Olonade (2022). Data Science Job Postings (Indeed USA) [Dataset]. https://www.kaggle.com/datasets/yusufolonade/data-science-job-postings-indeed-usa/data
Explore at:
zip(1754917 bytes)Available download formats
Dataset updated
Dec 27, 2022
Authors
Yusuf Olonade
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
United States
Description
Content

This dataset contains job listings across many data science positions which includes data scientist, machine learning engineer, data engineer, business analyst, data science manager, database administrator, business intelligence developer and director of data science in the US. There are 1200 rows and 9 columns. The column headings are job title, company, location, rating, date, salary, description (summary), links and descriptions (full). The data was web scraped from indeed web portal on Nov 20, 2022 using the indeed API.

Potential tasks

Datasets like this could help sharpen your skills in data cleaning, EDA, feature engineering, classification, clustering, text processing, NLP etc. There are many NaN entries in the salary column as most job listings do not provide salary info, can you come up with a way to fill those entries? The last column (descriptions) contains the full job description, with this at your disposal, there is an infinite number of features you could extract such as skill requirement, education, experience, etc. Can these features be utilized in a skill clustering analysis to guide curriculum development? Can you deploy a classification model for salary prediction? What other insight can you glean from the data? Have fun playing with the dataset. Happy learning!
f
F1-score results [40].
plos.figshare.com
xls
Updated Sep 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda (2024). F1-score results [40]. [Dataset]. http://doi.org/10.1371/journal.pone.0310707.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310707.t002
Dataset updated
Sep 26, 2024
Dataset provided by
PLOS ONE
Authors
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.
Insurance Customer Reviews
kaggle.com
zip
Updated Apr 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emre Kaan Yılmaz (2025). Insurance Customer Reviews [Dataset]. https://www.kaggle.com/datasets/emrekaany/insurance-customer-reviews-crm-action-plans
Explore at:
zip(131788 bytes)Available download formats
Dataset updated
Apr 29, 2025
Authors
Emre Kaan Yılmaz
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset is produced for an insurance company called 'Blue Insurance' and contains simulated customer reviews for various insurance products. It includes feedback from customers with positive, neutral, and negative experiences. Also, suggested CRM actions for those reviews.

Columns

customer_id: Unique identifier for each customer.

review_id: Unique identifier for each review.

review_text: Full insurance review text generated by AI.

rating: Customer rating (1 to 5 stars).

review_date: Date the review was supposedly written.

insurance_type: Insurance category (health, auto, life, property, etc.).

sentiment: Inferred sentiment of the review (positive, neutral, or negative).

Why You’ll Love This Dataset ❤️

NLP-ready: Perfect for sentiment analysis, rating prediction, text classification, and topic modeling.

CRM AI Agent Training: Ideal for training customer service chatbots and prompt-tuning CRM AI agents to handle real-life insurance customer conversations.

Multi-purpose: Great for machine learning pipelines, demo apps, and EDA practice.

No cleaning required: Fully structured and ready to use.

Use Cases

Fine-tune models for sentiment analysis.

Train CRM conversational AI agents on realistic insurance feedback.

Build rating prediction or review summarization models.

Showcase AI demos for insurance tech startups or academic projects.

Practice natural language processing (NLP) and EDA techniques.

Acknowledgments

Big thanks to Google Gemini AI and Faker library for making this synthetic dataset generation possible.

If you find this dataset valuable, please consider upvoting! 🚀
Predict Term Deposit
kaggle.com
zip
Updated Nov 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aslan Ahmedov (2021). Predict Term Deposit [Dataset]. https://www.kaggle.com/aslanahmedov/predict-term-deposit
Explore at:
zip(588608 bytes)Available download formats
Dataset updated
Nov 29, 2021
Authors
Aslan Ahmedov
Description
Predict Term Deposit

Introduction

Bank has multiple banking products that it sells to customer such as saving account, credit cards, investments etc. It wants to which customer will purchase its credit cards. For the same it has various kind of information regarding the demographic details of the customer, their banking behavior etc. Once it can predict the chances that customer will purchase a product, it wants to use the same to make pre-payment to the authors.

In this part I will demonstrate how to build a model, to predict which clients will subscribing to a term deposit, with inception of machine learning. In the ﬁrst part we will deal with the description and visualization of the analysed data, and in the second we will go to data classiﬁcation models.

Strategy

-Desire target -Data Understanding -Preprocessing Data -Machine learning Model -Prediction -Comparing Results

Desire Target

Predict if a client will subscribe (yes/no) to a term deposit — this is defined as a classification problem.

Data

The dataset (Assignment-2_data.csv) used in this assignment contains bank customers’ data. File name: Assignment-2_Data File format: . csv Numbers of Row: 45212 Numbers of Attributes: 17 non- empty conditional attributes attributes and one decision attribute.

https://user-images.githubusercontent.com/91852182/143783430-eafd25b0-6d40-40b8-ac5b-1c4f67ca9e02.png"> https://user-images.githubusercontent.com/91852182/143783451-3e49b817-29a6-4108-b597-ce35897dda4a.png">

Exploratory Data Analysis (EDA)

Data pre-processing is a main step in Machine Learning as the useful information which can be derived it from data set directly affects the model quality so it is extremely important to do at least necessary preprocess for our data before feeding it into our model.

In this assignment, we are going to utilize python to develop a predictive machine learning model. First, we will import some important and necessary libraries.

Below we are can see that there are various numerical and categorical columns. The most important column here is y, which is the output variable (desired target): this will tell us if the client subscribed to a term deposit(binary: ‘yes’,’no’).

https://user-images.githubusercontent.com/91852182/143783456-78c22016-149b-4218-a4a5-765ca348f069.png">

We must to check missing values in our dataset if we do have any and do, we have any duplicated values or not.

https://user-images.githubusercontent.com/91852182/143783471-a8656640-ec57-4f38-8905-35ef6f3e7f30.png">

We can see that in 'age' 9 missing values and 'balance' as well 3 values missed. In this case based that our dataset it has around 45k row I will remove them from dataset. on Pic 1 and 2 you will see before and after.

https://user-images.githubusercontent.com/91852182/143783474-b3898011-98e3-43c8-bd06-2cfcde714694.png">

From the above analysis we can see that only 5289 people out of 45200 have subscribed which is roughly 12%. We can see that our dataset highly unbalanced. we need to take it as a note.

https://user-images.githubusercontent.com/91852182/143783534-a05020a8-611d-4da1-98cf-4fec811cb5d8.png">

Our list of categorical variables.

https://user-images.githubusercontent.com/91852182/143783542-d40006cd-4086-4707-a683-f654a8cb2205.png">

Our list of numerical variables.

https://user-images.githubusercontent.com/91852182/143783551-6b220f99-2c4d-47d0-90ab-18ede42a4ae5.png">

"Age" Q-Q Plots and Box Plot.

In above boxplot we can see that some point in very young age and as well impossible age. So,

https://user-images.githubusercontent.com/91852182/143783564-ad0e2a27-5df5-4e04-b5d7-6d218cabd405.png"> https://user-images.githubusercontent.com/91852182/143783589-5abf0a0b-8bab-4192-98c8-d2e04f32a5c5.png">

Now, we don’t have issues on this feature so we can use it

https://user-images.githubusercontent.com/91852182/143783599-5205eddb-a0f5-446d-9f45-cc1adbfcce67.png"> https://user-images.githubusercontent.com/91852182/143783601-e520d59c-3b21-4627-a9bb-cac06f415a1e.png">

"Duration" Q-Q Plots and Box Plot

https://user-images.githubusercontent.com/91852182/143783634-03e5a584-a6fb-4bcb-8dc5-1f3cc50f9507.png"> https://user-images.githubusercontent.com/91852182/143783640-f6e71323-abbe-49c1-9935-35ffb2d10569.png">

This attribute highly affects the output target (e.g., if duration=0 then y=’no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes...
IntroDS
kaggle.com
zip
Updated Sep 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dayche (2023). IntroDS [Dataset]. https://www.kaggle.com/datasets/rouzbeh/introds
Explore at:
zip(2564 bytes)Available download formats
Dataset updated
Sep 7, 2023
Authors
Dayche
Description
Dataset for Beginners to start Data Science process. The subject of data is about simple clinical data for problem definition and solving. range of data science tasks such as classification, clustering, EDA and statistical analysis are using with dataset.

columns in data set are present: Age: Numerical (Age of patient) Sex: Binary (Gender of patient) BP: Nominal (Blood Pressure of patient with values: Low, Normal and High) Cholesterol: Nominal (Cholesterol of patient with values: Normal and High) Na: Numerical (Sodium level of patient experiment) K: Numerical (Potassium level of patient experiment) Drug: Nominal (Type of Drug that prescribed with doctor, with values: A, B, C, X and Y)
Job Descriptions Dataset
kaggle.com
Updated May 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jayakishan Minnekanti (2025). Job Descriptions Dataset [Dataset]. https://www.kaggle.com/datasets/jayakishan225/job-descriptions-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 12, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jayakishan Minnekanti
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset includes 521 real-world job descriptions for various data analyst roles, compiled solely for educational and research purposes. It was created to support natural language processing (NLP) and skill extraction tasks.

Each row represents a unique job posting with: - Job Title: The role being advertised - Description: The full-text job description

🔍 Use Case:
This dataset was used in the "Job Skill Analyzer" project, which applies NLP and multi-label classification to extract in-demand skills such as Python, SQL, Tableau, Power BI, Excel, and Communication.

🎯 Ideal For: - NLP-based skill extraction - Resume/job description matching - EDA on job market skill trends - Multi-label text classification projects

⚠️ Disclaimer:
- The job descriptions were collected from publicly available postings across multiple job boards.
- No logos, branding, or personally identifiable information is included.
- This dataset is not intended for commercial use.

License: CC BY-NC-SA 4.0
Suitable For: NLP, EDA, Job Market Analysis, Skill Mining, Text Classification
Toy Dataset
kaggle.com
zip
Updated Dec 10, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carlo Lepelaars (2018). Toy Dataset [Dataset]. https://www.kaggle.com/datasets/carlolepelaars/toy-dataset
Explore at:
zip(1184308 bytes)Available download formats
Dataset updated
Dec 10, 2018
Authors
Carlo Lepelaars
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

A fictional dataset for exploratory data analysis (EDA) and to test simple prediction models.

This toy dataset features 150000 rows and 6 columns.

Columns

Note: All data is fictional. The data has been generated so that their distributions are convenient for statistical analysis.

Number: A simple index number for each row

City: The location of a person (Dallas, New York City, Los Angeles, Mountain View, Boston, Washington D.C., San Diego and Austin)

Gender: Gender of a person (Male or Female)

Age: The age of a person (Ranging from 25 to 65 years)

Income: Annual income of a person (Ranging from -674 to 177175)

Illness: Is the person Ill? (Yes or No)

Acknowledgements

Stock photo by Mika Baumeister on Unsplash.
Preventive Maintenance for Marine Engines
kaggle.com
zip
Updated Feb 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fijabi J. Adekunle (2025). Preventive Maintenance for Marine Engines [Dataset]. https://www.kaggle.com/datasets/jeleeladekunlefijabi/preventive-maintenance-for-marine-engines
Explore at:
zip(436025 bytes)Available download formats
Dataset updated
Feb 12, 2025
Authors
Fijabi J. Adekunle
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Preventive Maintenance for Marine Engines: Data-Driven Insights

Introduction:

Marine engine failures can lead to costly downtime, safety risks and operational inefficiencies. This project leverages machine learning to predict maintenance needs, helping ship operators prevent unexpected breakdowns. Using a simulated dataset, we analyze key engine parameters and develop predictive models to classify maintenance status into three categories: Normal, Requires Maintenance, and Critical.

Overview This project explores preventive maintenance strategies for marine engines by analyzing operational data and applying machine learning techniques.

Key steps include: 1. Data Simulation: Creating a realistic dataset with engine performance metrics. 2. Exploratory Data Analysis (EDA): Understanding trends and patterns in engine behavior. 3. Model Training & Evaluation: Comparing machine learning models (Decision Tree, Random Forest, XGBoost) to predict maintenance needs. 4. Hyperparameter Tuning: Using GridSearchCV to optimize model performance.

Tools Used 1. Python: Data processing, analysis and modeling 2. Pandas & NumPy: Data manipulation 3. Scikit-Learn & XGBoost: Machine learning model training 4. Matplotlib & Seaborn: Data visualization

Skills Demonstrated ✔ Data Simulation & Preprocessing ✔ Exploratory Data Analysis (EDA) ✔ Feature Engineering & Encoding ✔ Supervised Machine Learning (Classification) ✔ Model Evaluation & Hyperparameter Tuning

Key Insights & Findings 📌 Engine Temperature & Vibration Level: Strong indicators of potential failures. 📌 Random Forest vs. XGBoost: After hyperparameter tuning, both models achieved comparable performance, with Random Forest performing slightly better. 📌 Maintenance Status Distribution: Balanced dataset ensures unbiased model training. 📌 Failure Modes: The most common issues were Mechanical Wear & Oil Leakage, aligning with real-world engine failure trends.

Challenges Faced 🚧 Simulating Realistic Data: Ensuring the dataset reflects real-world marine engine behavior was a key challenge. 🚧 Model Performance: The accuracy was limited (~35%) due to the complexity of failure prediction. 🚧 Feature Selection: Identifying the most impactful features required extensive analysis.

Call to Action 🔍 Explore the Dataset & Notebook: Try running different models and tweaking hyperparameters. 📊 Extend the Analysis: Incorporate additional sensor data or alternative machine learning techniques. 🚀 Real-World Application: This approach can be adapted for industrial machinery, aircraft engines, and power plants.
Telco customer churn IBM dataset
kaggle.com
zip
Updated Nov 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Waseem AlAstal (2024). Telco customer churn IBM dataset [Dataset]. https://www.kaggle.com/datasets/waseemalastal/telco-customer-churn-ibm-dataset/code
Explore at:
zip(1314712 bytes)Available download formats
Dataset updated
Nov 3, 2024
Authors
Waseem AlAstal
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
"Telecom Customer Churn Analysis and Prediction Dataset"

This dataset contains information on customers from a telecommunications company, designed to help identify the key factors that influence customer churn. Churn in the telecom industry refers to customers discontinuing their service, which has significant financial implications for service providers. Understanding why customers leave can help companies improve customer retention strategies, reduce churn rates, and enhance overall customer satisfaction.

Context & Source

The dataset provides real-world insights into telecom customer behavior, covering demographic, account, and usage information. This includes attributes like customer demographics, contract type, payment method, tenure, usage patterns, and whether the customer churned. Each record represents an individual customer, with labeled data indicating whether the customer is active or has churned.

This data is inspired by real-world telecom challenges and was created to support machine learning tasks such as classification, clustering, and exploratory data analysis (EDA). It’s particularly valuable for data scientists interested in predictive modeling for churn, as well as for business analysts working on customer retention strategies.

Potential Uses and Inspiration

This dataset can be used for:

Building predictive models to classify customers as churned or active Analyzing which factors contribute most to churn Designing interventions for at-risk customers Practicing data preprocessing, feature engineering, and visualization skills Whether you’re a beginner in machine learning or an experienced data scientist, this dataset offers opportunities to explore the complexities of customer behavior in the telecom industry and to develop strategies that can help reduce customer churn.
Customer_support_data
kaggle.com
zip
Updated May 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akash Bommidi (2025). Customer_support_data [Dataset]. https://www.kaggle.com/datasets/akashbommidi/customer-support-data/code
Explore at:
zip(6567106 bytes)Available download formats
Dataset updated
May 19, 2025
Authors
Akash Bommidi
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset contains detailed records of customer interactions handled by a customer service team through various communication channels such as inbound calls, outbound calls, and digital touchpoints. It includes over 85,000 entries with information related to the nature of the issue, product categories, agent details, and customer satisfaction scores (CSAT).

Key features include:

Issue Metadata: Timestamps for when the issue was reported and responded to.

Categorization: High-level and sub-level issue categories for better analysis.

Agent Information: Names, supervisors, managers, shift, and tenure bucket.

Customer Feedback: CSAT scores and free-text customer remarks.

Transactional Data:Order IDs, product categories, item prices, and customer city.

This dataset is ideal for exploratory data analysis (EDA), natural language processing (NLP), time-to-resolution analysis, customer satisfaction prediction, and performance benchmarking of service agents.

Feature-wise Explanation

Unique id: A unique identifier for each customer support ticket. Used for tracking, not used in modeling.

channel_name: The communication channel used by the customer (e.g., Email, Chat, Phone), which influences response quality and time.

category: Broad classification of the support issue (e.g., Technical, Billing, Account), useful in understanding issue trends.

Sub-category: More specific issue label under each category (e.g., "Login Failure" under Technical) to capture granular insights.

Customer Remarks: Free-text input from customers about their issue; useful for sentiment analysis or NLP-based features.

Order_id: The ID of the order associated with the issue; may not be directly useful unless joined with order metadata.

order_date_time: Timestamp of the order; can be used to derive delays or time gaps relative to issue date.

Issue_reported at: Time when the customer reported the issue; helps calculate response and resolution delays.

issue_responded: Time when the support agent responded; combined with report time to calculate response duration.

Survey_response_Date: Date when customer gave the CSAT feedback; useful to understand follow-up timing, but not always predictive.

Customer_City: The city where the customer resides; can identify location-based trends or systemic issues.

Product_category: The type of product involved in the support ticket; some product types may result in higher or lower CSAT.

Item_price: Price of the item involved; higher prices might lead to higher customer expectations and affect satisfaction.

connected_handling_time: Total time spent by the agent resolving the issue; excessive durations may signal complexity or inefficiency.

Agent_name: Name of the support agent handling the ticket; can be encoded to understand individual performance impact.

Supervisor: The agent’s supervisor; useful to analyze team-level trends in CSAT.

Manager: The manager overseeing the support process; can help identify management-level influence on support quality.

Tenure Bucket: Agent experience group (e.g., 0–6 months, 6–12 months); more experienced agents might resolve issues better.

Agent Shift: Time shift during which the case was handled (e.g., Day, Night); night shifts might see different trends in CSAT.

CSAT Score (Target Variable): Customer satisfaction score (1 to 5); the main variable we aim to classify using other features.
Retail Sales and Customer Behavior Analysis
kaggle.com
zip
Updated Jul 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UTKAL KUMAR BALIYARSINGH (2024). Retail Sales and Customer Behavior Analysis [Dataset]. https://www.kaggle.com/datasets/utkalk/large-retail-data-set-for-eda
Explore at:
zip(170748344 bytes)Available download formats
Dataset updated
Jul 7, 2024
Authors
UTKAL KUMAR BALIYARSINGH
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Data Set Description This dataset simulates a retail environment with a million rows and 100+ columns, covering customer information, transactional data, product details, promotional information, and customer behavior metrics. It includes data for predicting total sales (regression) and customer churn (classification).

Detailed Column Descriptions Customer Information:

customer_id: Unique identifier for each customer. age: Age of the customer. gender: Gender of the customer (e.g., Male, Female, Other). income_bracket: Income bracket of the customer (e.g., Low, Medium, High). loyalty_program: Whether the customer is part of a loyalty program (Yes/No). membership_years: Number of years the customer has been a member. churned: Whether the customer has churned (Yes/No) - Target for classification. marital_status: Marital status of the customer. number_of_children: Number of children the customer has. education_level: Education level of the customer (e.g., High School, Bachelor's, Master's). occupation: Occupation of the customer. Transactional Data:

transaction_id: Unique identifier for each transaction. transaction_date: Date of the transaction. product_id: Unique identifier for each product. product_category: Category of the product (e.g., Electronics, Clothing, Groceries). quantity: Quantity of the product purchased. unit_price: Price per unit of the product. discount_applied: Discount applied on the transaction. payment_method: Payment method used (e.g., Credit Card, Debit Card, Cash). store_location: Location of the store where the purchase was made. Customer Behavior Metrics:

avg_purchase_value: Average value of purchases made by the customer. purchase_frequency: Frequency of purchases (e.g., Daily, Weekly, Monthly, Yearly). last_purchase_date: Date of the last purchase made by the customer. avg_discount_used: Average discount percentage used by the customer. preferred_store: Store location most frequently visited by the customer. online_purchases: Number of online purchases made by the customer. in_store_purchases: Number of in-store purchases made by the customer. avg_items_per_transaction: Average number of items per transaction. avg_transaction_value: Average value per transaction. total_returned_items: Total number of items returned by the customer. total_returned_value: Total value of returned items. Sales Data:

total_sales: Total sales amount for each customer over the last year - Target for regression. total_transactions: Total number of transactions made by each customer. total_items_purchased: Total number of items purchased by each customer. total_discounts_received: Total discounts received by each customer. avg_spent_per_category: Average amount spent per product category. max_single_purchase_value: Maximum value of a single purchase. min_single_purchase_value: Minimum value of a single purchase. Product Information:

product_name: Name of the product. product_brand: Brand of the product. product_rating: Customer rating of the product. product_review_count: Number of reviews for the product. product_stock: Stock availability of the product. product_return_rate: Rate at which the product is returned. product_size: Size of the product (if applicable). product_weight: Weight of the product (if applicable). product_color: Color of the product (if applicable). product_material: Material of the product (if applicable). product_manufacture_date: Manufacture date of the product. product_expiry_date: Expiry date of the product (if applicable). product_shelf_life: Shelf life of the product (if applicable). Promotional Data:

promotion_id: Unique identifier for each promotion. promotion_type: Type of promotion (e.g., Buy One Get One Free, 20% Off). promotion_start_date: Start date of the promotion. promotion_end_date: End date of the promotion. promotion_effectiveness: Effectiveness of the promotion (e.g., High, Medium, Low). promotion_channel: Channel through which the promotion was advertised (e.g., Online, In-store, Social Media). promotion_target_audience: Target audience for the promotion (e.g., New Customers, Returning Customers). Geographical Data:

customer_zip_code: Zip code of the customer's residence. customer_city: City of the customer's residence. customer_state: State of the customer's residence. store_zip_code: Zip code of the store. store_city: City where the store is located. store_state: State where the store is located. distance_to_store: Distance from the customer's residence to the store. Seasonal and Temporal Data:

holiday_season: Whether the transaction occurred during a holiday season (Yes/No). season: Season of the year (e.g., Winter, Spring, Summer, Fall). weekend: Whether the transaction occurred on a weekend (Yes/No). Customer Interaction Data:

customer_support_calls: Number of calls made to customer support. email_subscription...
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586

Data Analysis for the Systematic Literature Review of DL4SE

Explore at:

Dataset updated

Jul 19, 2024

Dataset provided by

Washington and Lee University
College of William and Mary

Authors

Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise

Clear search

Close search

Google apps

Main menu

Data Analysis for the Systematic Literature Review of DL4SE

Raisin Variety Classification Dataset

Data Description

Detailed characterization of the dataset.

Classification with an Academic Success Dataset

Classification with an Academic Success Dataset

Objective:

Dataset:

Tools and Techniques:

Outcome:

Skills Required:

Expected Deliverables:

Table 1_Physiological parameters to support attention deficit hyperactivity...

Cdd Dataset

Bank Fraud Dataset

mixed-data-32x92x3-

GISL mixed_data_gen

Thanks

used datasets

used notebooks

about dataset code:

dataset notebook

dataset notebook

Electrodermal Activity and Electrocardiogram during submaximal exercise test...

Data Science Job Postings (Indeed USA)

F1-score results [40].

Insurance Customer Reviews

Columns

Why You’ll Love This Dataset ❤️

Use Cases

Acknowledgments

If you find this dataset valuable, please consider upvoting! 🚀

Predict Term Deposit

Predict Term Deposit

Introduction

Strategy

Desire Target

Data

Exploratory Data Analysis (EDA)

"Age" Q-Q Plots and Box Plot.

"Duration" Q-Q Plots and Box Plot

IntroDS

Job Descriptions Dataset

Toy Dataset

Context

Columns

Acknowledgements

Preventive Maintenance for Marine Engines

Telco customer churn IBM dataset

Customer_support_data

Feature-wise Explanation

Retail Sales and Customer Behavior Analysis

Data Analysis for the Systematic Literature Review of DL4SE