Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.
The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.
Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:
Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.
Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.
Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.
Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).
We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.
Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.
Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise
Facebook
Twitterhttps://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F14886202%2Ff366b6f91d9b8e2f15be6354e2d42de1%2FCabernet_Sauvignon_Gaillac.jpg?generation=1722803447627177&alt=media" alt="">
The dataset includes images of Kecimen and Besni raisin varieties grown in Turkey, with a total of 900 raisin grains, including 450 pieces from each variety. These images were captured using CVS and underwent various stages of pre-processing. A total of 7 morphological features were extracted from these images and classified using three different artificial intelligence techniques.
Data Fields:
Çinar,İ̇lkay, Koklu,Murat, and Tasdemir,Sakir. (2023). Raisin. UCI Machine Learning Repository. https://doi.org/10.24432/C5660T.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The goal of this project is to develop a classification model that can predict whether a student will pass based on various academic and demographic factors. By analyzing the provided dataset, we aim to identify key predictors of academic success and build a model that can help educational institutions improve student outcomes.
The dataset consists of 51,012 rows and includes the following columns:
A trained classification model that can predict whether a student will pass based on the provided features. A detailed report outlining the steps taken in the analysis, the performance of different models, and the final model's evaluation metrics. Visualizations to illustrate the relationships between different features and the target variable. Recommendations for educational institutions based on the findings of the analysis.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionAttention deficit hyperactivity disorder (ADHD) is a high-prevalent neurodevelopmental disorder characterized by inattention, impulsivity, and hyperactivity, frequently co-occurring with other psychiatric and medical conditions. Current diagnosis is time-consuming and often delays effective treatment; to date, no valid biomarker has been identified to facilitate this process. Research has linked the core symptoms of ADHD to autonomic dysfunction resulting from impaired arousal modulation, which contributes to physiological abnormalities that may serve as useful biomarkers for the disorder. While recent research has explored alternative objective assessment tools, few have specifically focused on studying ADHD autonomic dysregulation through physiological parameters. This study aimed to design a multiparametric physiological model to support ADHD diagnosis.MethodsIn this observational study we non-invasively analyzed heart rate variability (HRV), electrodermal activity (EDA), respiration, and skin temperature parameters of 69 treatment-naïve ADHD children and 29 typically developing (TD) controls (7-12 years old). To identify the most relevant parameters to discriminate ADHD children from controls, we explored the physiological behavior at baseline and during a sustained attention task and applied a logistic regression procedure.ResultsADHD children showed increased HRV and lower EDA at baseline. The stress-inducing task elicits higher reactivity for EDA, pulse arrival time (PAT), and respiratory frequency in the ADHD group. The final classification model included 4 physiological parameters and was adjusted by gender and age. A good capacity to discriminate between ADHD children and TD controls was obtained, with an accuracy rate of 85.5% and an AUC of 0.95.DiscussionOur findings suggest that a multiparametric physiological model constitutes an accurate tool that can be easily employed to support ADHD diagnosis in clinical practice. The discrimination capacity of the model may be analyzed in larger samples to confirm the possibility of generalization.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Project Documentation: Cucumber Disease Detection
Introduction: A machine learning model for the automatic detection of diseases in cucumber plants is to be developed as part of the "Cucumber Disease Detection" project. This research is crucial because it tackles the issue of early disease identification in agriculture, which can increase crop yield and cut down on financial losses. To train and test the model, we use a dataset of pictures of cucumber plants.
Importance: Early disease diagnosis helps minimize crop losses, stop the spread of diseases, and better allocate resources in farming. Agriculture is a real-world application of this concept.
Goals and Objectives: Develop a machine learning model to classify cucumber plant images into healthy and diseased categories. Achieve a high level of accuracy in disease detection. Provide a tool for farmers to detect diseases early and take appropriate action.
Data Collection: Using cameras and smartphones, images from agricultural areas were gathered.
Data Preprocessing: Data cleaning to remove irrelevant or corrupted images. Handling missing values, if any, in the dataset. Removing outliers that may negatively impact model training. Data augmentation techniques applied to increase dataset diversity.
Exploratory Data Analysis (EDA) The dataset was examined using visuals like scatter plots and histograms. The data was examined for patterns, trends, and correlations. Understanding the distribution of photos of healthy and ill plants was made easier by EDA.
Methodology Machine Learning Algorithms:
Convolutional Neural Networks (CNNs) were chosen for image classification due to their effectiveness in handling image data. Transfer learning using pre-trained models such as ResNet or MobileNet may be considered. Train-Test Split:
The dataset was split into training and testing sets with a suitable ratio. Cross-validation may be used to assess model performance robustly.
Model Development The CNN model's architecture consists of layers, units, and activation operations. On the basis of experimentation, hyperparameters including learning rate, batch size, and optimizer were chosen. To avoid overfitting, regularization methods like dropout and L2 regularization were used.
Model Training During training, the model was fed the prepared dataset across a number of epochs. The loss function was minimized using an optimization method. To ensure convergence, early halting and model checkpoints were used.
Model Evaluation Evaluation Metrics:
Accuracy, precision, recall, F1-score, and confusion matrix were used to assess model performance. Results were computed for both training and test datasets. Performance Discussion:
The model's performance was analyzed in the context of disease detection in cucumber plants. Strengths and weaknesses of the model were identified.
Results and Discussion Key project findings include model performance and disease detection precision. a comparison of the many models employed, showing the benefits and drawbacks of each. challenges that were faced throughout the project and the methods used to solve them.
Conclusion recap of the project's key learnings. the project's importance to early disease detection in agriculture should be highlighted. Future enhancements and potential research directions are suggested.
References Library: Pillow,Roboflow,YELO,Sklearn,matplotlib Datasets:https://data.mendeley.com/datasets/y6d3z6f8z9/1
Code Repository https://universe.roboflow.com/hakuna-matata/cdd-g8a6g
Rafiur Rahman Rafit EWU 2018-3-60-111
Facebook
TwitterThe dataset has one training dataset, one testing (unseen) dataset, which is unlabeled, and a clickstream dataset, all interconnected through a common identifier known as "SESSION_ID." This identifier allows us to link user actions across the datasets. A session involves client online banking activities like signing in, updating passwords, viewing products, or adding items to the cart.
Majority of fraud cases add new shipping address, or change password. you can do visualization to get more insights about the nature of frauds.
I also added 2 datasets named "train/test_dataset_combined" which are the merged version of the train and test datasets based on the "SESSION_ID" column. For more information, please refer to this link: https://www.kaggle.com/code/mohammadbolandraftar/combine-datasets-in-pandas
In addition, I added the cleaned dataset after doing EDA. For more information about the EDA process, please refer to this link: https://www.kaggle.com/code/mohammadbolandraftar/a-deep-dive-into-fraud-detection-through-eda
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
I would like to express my gratitude to @markwijkhuizen and @dschettler8845 for sharing their notebook and dataset on Kaggle.
https://www.kaggle.com/datasets/markwijkhuizen/gislr-dataset-public/versions/1
https://www.kaggle.com/datasets/dschettler8845/gislr-extended-train-dataframe
https://www.kaggle.com/code/markwijkhuizen/gislr-tf-data-processing-transformer-training
https://www.kaggle.com/code/dschettler8845/gislr-learn-eda-baseline
https://www.kaggle.com/mustafakeser4/mixed-data-gen
This dataset's code appears to be preparing data for creating mixed data from two different samples and saving them as numpy arrays.
It starts by grouping a dataframe train_df by two columns, total_frames and sign. Then it applies a lambda function to retrieve the index of each group, and filters the indices to only include groups with more than one index.
The code then loops over each filtered group, and for each group it loops over the indices and creates a mixed array of two samples with different frames. It saves each mixed array as a numpy file in a directory.
Finally, the code loads the numpy files from the directory and creates three numpy arrays X_ms, labels, and noneidxs. It saves these arrays as numpy files in a different directory. These arrays represent the mixed data, their labels, and the indices of non-empty frames in the mixed data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Subjects for whom exercise represents a low risk level, based on standardized guidelines from the American College of Sports Medicine (ACSM) [20], were asked to participate in the study. Eighteen healthy subjects, 11 males and 7 females, age 21 ± 3 years were enrolled. Participants were asked to avoid caffeine and alcohol during the 48 hours preceding the test, and were instructed to fast (water only) for at least 3 h before testing. The study was conducted in a quiet, comfortable room (ambient temperature, 18-20 °C, and relative humidity between 30-50%). All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. Informed consent was obtained from all individual participants included in the study. This protocol was approved by the Institutional Review Board of the University of Connecticut.
Before the exercise test began, the subjects were asked to lay in the supine position for 5 min to procure hemodynamic stabilization prior to 5 minutes of data collection in this position. ECG and EDA were measured simultaneously for each subject throughout the entire experiment. The ECG signal was used to monitor subjects’ HR throughout the experiment. An HP ECG monitor (HP 78354A) and GSR ADInstruments module were used. Three hydrogel Ag-AgCl electrodes were used for ECG signal collection. The electrodes were placed on the shoulders and lower left rib. In addition, a pair of stainless steel electrodes were placed on index and middle fingers of the right hand to collect the EDA signal. Subjects were instructed to keep their right hand stable, raised at chest height. The skin was cleaned with alcohol before placing the ECG and EDA electrodes. The leads were taped to the subject’s skin using latex-free tape, to avoid movement of the cables, which can corrupt the signals. All signals were acquired through the ADInstruments analog-to-digital converter, and compatible PowerLab software, while the sampling frequency was fixed to 400 Hz for all signals. Participants were asked to wear their own active wear/gym clothes during the protocol with the shirt covering the electrodes and cables during the experiment.Subjects were first monitored for 5 min at rest (supine, without any movement or talking) to measure resting HR and EDA. The subjects then performed the incremental test on a motorized treadmill (Life Fitness F3). 85% HRmax was calculated from the equation HRmax = 206.9-(0.67*age).The incremental running began with an initial warm-up, followed by walking at 3mi/h (~ 4.82 km/h). The speed was increased to 5 mi/h (~ 8 km/h) and increased 0.6 mi/h (about 1 km/h) every subsequent minute until the subjects reached 85% of their HRmax. When a subject reached 85% of HRmax within 2 min of running, the data were excluded because at least 2 minutes of data are required for processing. The 18 subjects enrolled for this study represents those who were able to provide at least 2 minutes of data prior to reaching 85% of HRmax. After subjects reached 85% of their HRmax, treadmill speed was reduced to 5 mi/h (~ 8 km/h) for another 4 min to start the recovery phase, followed by walking at 3 mi/h (about 4.82 km/h) for 5 minutes. A final 10 min period (or more if needed to achieve baseline HR) in the supine position was utilized to allow HR to return to baseline. The duration of the experiment was approximately one hour.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Content
This dataset contains job listings across many data science positions which includes data scientist, machine learning engineer, data engineer, business analyst, data science manager, database administrator, business intelligence developer and director of data science in the US. There are 1200 rows and 9 columns. The column headings are job title, company, location, rating, date, salary, description (summary), links and descriptions (full). The data was web scraped from indeed web portal on Nov 20, 2022 using the indeed API.
Potential tasks
Datasets like this could help sharpen your skills in data cleaning, EDA, feature engineering, classification, clustering, text processing, NLP etc. There are many NaN entries in the salary column as most job listings do not provide salary info, can you come up with a way to fill those entries? The last column (descriptions) contains the full job description, with this at your disposal, there is an infinite number of features you could extract such as skill requirement, education, experience, etc. Can these features be utilized in a skill clustering analysis to guide curriculum development? Can you deploy a classification model for salary prediction? What other insight can you glean from the data? Have fun playing with the dataset. Happy learning!
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset is produced for an insurance company called 'Blue Insurance' and contains simulated customer reviews for various insurance products. It includes feedback from customers with positive, neutral, and negative experiences. Also, suggested CRM actions for those reviews.
Big thanks to Google Gemini AI and Faker library for making this synthetic dataset generation possible.
Facebook
TwitterBank has multiple banking products that it sells to customer such as saving account, credit cards, investments etc. It wants to which customer will purchase its credit cards. For the same it has various kind of information regarding the demographic details of the customer, their banking behavior etc. Once it can predict the chances that customer will purchase a product, it wants to use the same to make pre-payment to the authors.
In this part I will demonstrate how to build a model, to predict which clients will subscribing to a term deposit, with inception of machine learning. In the first part we will deal with the description and visualization of the analysed data, and in the second we will go to data classification models.
-Desire target -Data Understanding -Preprocessing Data -Machine learning Model -Prediction -Comparing Results
Predict if a client will subscribe (yes/no) to a term deposit — this is defined as a classification problem.
The dataset (Assignment-2_data.csv) used in this assignment contains bank customers’ data. File name: Assignment-2_Data File format: . csv Numbers of Row: 45212 Numbers of Attributes: 17 non- empty conditional attributes attributes and one decision attribute.
https://user-images.githubusercontent.com/91852182/143783430-eafd25b0-6d40-40b8-ac5b-1c4f67ca9e02.png">
https://user-images.githubusercontent.com/91852182/143783451-3e49b817-29a6-4108-b597-ce35897dda4a.png">
Data pre-processing is a main step in Machine Learning as the useful information which can be derived it from data set directly affects the model quality so it is extremely important to do at least necessary preprocess for our data before feeding it into our model.
In this assignment, we are going to utilize python to develop a predictive machine learning model. First, we will import some important and necessary libraries.
Below we are can see that there are various numerical and categorical columns. The most important column here is y, which is the output variable (desired target): this will tell us if the client subscribed to a term deposit(binary: ‘yes’,’no’).
https://user-images.githubusercontent.com/91852182/143783456-78c22016-149b-4218-a4a5-765ca348f069.png">
We must to check missing values in our dataset if we do have any and do, we have any duplicated values or not.
https://user-images.githubusercontent.com/91852182/143783471-a8656640-ec57-4f38-8905-35ef6f3e7f30.png">
We can see that in 'age' 9 missing values and 'balance' as well 3 values missed. In this case based that our dataset it has around 45k row I will remove them from dataset. on Pic 1 and 2 you will see before and after.
https://user-images.githubusercontent.com/91852182/143783474-b3898011-98e3-43c8-bd06-2cfcde714694.png">
From the above analysis we can see that only 5289 people out of 45200 have subscribed which is roughly 12%. We can see that our dataset highly unbalanced. we need to take it as a note.
https://user-images.githubusercontent.com/91852182/143783534-a05020a8-611d-4da1-98cf-4fec811cb5d8.png">
Our list of categorical variables.
https://user-images.githubusercontent.com/91852182/143783542-d40006cd-4086-4707-a683-f654a8cb2205.png">
Our list of numerical variables.
https://user-images.githubusercontent.com/91852182/143783551-6b220f99-2c4d-47d0-90ab-18ede42a4ae5.png">
In above boxplot we can see that some point in very young age and as well impossible age. So,
https://user-images.githubusercontent.com/91852182/143783564-ad0e2a27-5df5-4e04-b5d7-6d218cabd405.png">
https://user-images.githubusercontent.com/91852182/143783589-5abf0a0b-8bab-4192-98c8-d2e04f32a5c5.png">
Now, we don’t have issues on this feature so we can use it
https://user-images.githubusercontent.com/91852182/143783599-5205eddb-a0f5-446d-9f45-cc1adbfcce67.png">
https://user-images.githubusercontent.com/91852182/143783601-e520d59c-3b21-4627-a9bb-cac06f415a1e.png">
https://user-images.githubusercontent.com/91852182/143783634-03e5a584-a6fb-4bcb-8dc5-1f3cc50f9507.png">
https://user-images.githubusercontent.com/91852182/143783640-f6e71323-abbe-49c1-9935-35ffb2d10569.png">
This attribute highly affects the output target (e.g., if duration=0 then y=’no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes...
Facebook
TwitterDataset for Beginners to start Data Science process. The subject of data is about simple clinical data for problem definition and solving. range of data science tasks such as classification, clustering, EDA and statistical analysis are using with dataset.
columns in data set are present: Age: Numerical (Age of patient) Sex: Binary (Gender of patient) BP: Nominal (Blood Pressure of patient with values: Low, Normal and High) Cholesterol: Nominal (Cholesterol of patient with values: Normal and High) Na: Numerical (Sodium level of patient experiment) K: Numerical (Potassium level of patient experiment) Drug: Nominal (Type of Drug that prescribed with doctor, with values: A, B, C, X and Y)
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset includes 521 real-world job descriptions for various data analyst roles, compiled solely for educational and research purposes. It was created to support natural language processing (NLP) and skill extraction tasks.
Each row represents a unique job posting with:
- Job Title: The role being advertised
- Description: The full-text job description
🔍 Use Case:
This dataset was used in the "Job Skill Analyzer" project, which applies NLP and multi-label classification to extract in-demand skills such as Python, SQL, Tableau, Power BI, Excel, and Communication.
🎯 Ideal For: - NLP-based skill extraction - Resume/job description matching - EDA on job market skill trends - Multi-label text classification projects
⚠️ Disclaimer:
- The job descriptions were collected from publicly available postings across multiple job boards.
- No logos, branding, or personally identifiable information is included.
- This dataset is not intended for commercial use.
License: CC BY-NC-SA 4.0
Suitable For: NLP, EDA, Job Market Analysis, Skill Mining, Text Classification
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
A fictional dataset for exploratory data analysis (EDA) and to test simple prediction models.
This toy dataset features 150000 rows and 6 columns.
Note: All data is fictional. The data has been generated so that their distributions are convenient for statistical analysis.
Number: A simple index number for each row
City: The location of a person (Dallas, New York City, Los Angeles, Mountain View, Boston, Washington D.C., San Diego and Austin)
Gender: Gender of a person (Male or Female)
Age: The age of a person (Ranging from 25 to 65 years)
Income: Annual income of a person (Ranging from -674 to 177175)
Illness: Is the person Ill? (Yes or No)
Stock photo by Mika Baumeister on Unsplash.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Preventive Maintenance for Marine Engines: Data-Driven Insights
Introduction:
Marine engine failures can lead to costly downtime, safety risks and operational inefficiencies. This project leverages machine learning to predict maintenance needs, helping ship operators prevent unexpected breakdowns. Using a simulated dataset, we analyze key engine parameters and develop predictive models to classify maintenance status into three categories: Normal, Requires Maintenance, and Critical.
Overview This project explores preventive maintenance strategies for marine engines by analyzing operational data and applying machine learning techniques.
Key steps include: 1. Data Simulation: Creating a realistic dataset with engine performance metrics. 2. Exploratory Data Analysis (EDA): Understanding trends and patterns in engine behavior. 3. Model Training & Evaluation: Comparing machine learning models (Decision Tree, Random Forest, XGBoost) to predict maintenance needs. 4. Hyperparameter Tuning: Using GridSearchCV to optimize model performance.
Tools Used 1. Python: Data processing, analysis and modeling 2. Pandas & NumPy: Data manipulation 3. Scikit-Learn & XGBoost: Machine learning model training 4. Matplotlib & Seaborn: Data visualization
Skills Demonstrated ✔ Data Simulation & Preprocessing ✔ Exploratory Data Analysis (EDA) ✔ Feature Engineering & Encoding ✔ Supervised Machine Learning (Classification) ✔ Model Evaluation & Hyperparameter Tuning
Key Insights & Findings 📌 Engine Temperature & Vibration Level: Strong indicators of potential failures. 📌 Random Forest vs. XGBoost: After hyperparameter tuning, both models achieved comparable performance, with Random Forest performing slightly better. 📌 Maintenance Status Distribution: Balanced dataset ensures unbiased model training. 📌 Failure Modes: The most common issues were Mechanical Wear & Oil Leakage, aligning with real-world engine failure trends.
Challenges Faced 🚧 Simulating Realistic Data: Ensuring the dataset reflects real-world marine engine behavior was a key challenge. 🚧 Model Performance: The accuracy was limited (~35%) due to the complexity of failure prediction. 🚧 Feature Selection: Identifying the most impactful features required extensive analysis.
Call to Action 🔍 Explore the Dataset & Notebook: Try running different models and tweaking hyperparameters. 📊 Extend the Analysis: Incorporate additional sensor data or alternative machine learning techniques. 🚀 Real-World Application: This approach can be adapted for industrial machinery, aircraft engines, and power plants.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
"Telecom Customer Churn Analysis and Prediction Dataset"
This dataset contains information on customers from a telecommunications company, designed to help identify the key factors that influence customer churn. Churn in the telecom industry refers to customers discontinuing their service, which has significant financial implications for service providers. Understanding why customers leave can help companies improve customer retention strategies, reduce churn rates, and enhance overall customer satisfaction.
Context & Source
The dataset provides real-world insights into telecom customer behavior, covering demographic, account, and usage information. This includes attributes like customer demographics, contract type, payment method, tenure, usage patterns, and whether the customer churned. Each record represents an individual customer, with labeled data indicating whether the customer is active or has churned.
This data is inspired by real-world telecom challenges and was created to support machine learning tasks such as classification, clustering, and exploratory data analysis (EDA). It’s particularly valuable for data scientists interested in predictive modeling for churn, as well as for business analysts working on customer retention strategies.
Potential Uses and Inspiration
This dataset can be used for:
Building predictive models to classify customers as churned or active Analyzing which factors contribute most to churn Designing interventions for at-risk customers Practicing data preprocessing, feature engineering, and visualization skills Whether you’re a beginner in machine learning or an experienced data scientist, this dataset offers opportunities to explore the complexities of customer behavior in the telecom industry and to develop strategies that can help reduce customer churn.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains detailed records of customer interactions handled by a customer service team through various communication channels such as inbound calls, outbound calls, and digital touchpoints. It includes over 85,000 entries with information related to the nature of the issue, product categories, agent details, and customer satisfaction scores (CSAT).
Key features include:
Issue Metadata: Timestamps for when the issue was reported and responded to.
Categorization: High-level and sub-level issue categories for better analysis.
Agent Information: Names, supervisors, managers, shift, and tenure bucket.
Customer Feedback: CSAT scores and free-text customer remarks.
Transactional Data:Order IDs, product categories, item prices, and customer city.
This dataset is ideal for exploratory data analysis (EDA), natural language processing (NLP), time-to-resolution analysis, customer satisfaction prediction, and performance benchmarking of service agents.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Data Set Description This dataset simulates a retail environment with a million rows and 100+ columns, covering customer information, transactional data, product details, promotional information, and customer behavior metrics. It includes data for predicting total sales (regression) and customer churn (classification).
Detailed Column Descriptions Customer Information:
customer_id: Unique identifier for each customer. age: Age of the customer. gender: Gender of the customer (e.g., Male, Female, Other). income_bracket: Income bracket of the customer (e.g., Low, Medium, High). loyalty_program: Whether the customer is part of a loyalty program (Yes/No). membership_years: Number of years the customer has been a member. churned: Whether the customer has churned (Yes/No) - Target for classification. marital_status: Marital status of the customer. number_of_children: Number of children the customer has. education_level: Education level of the customer (e.g., High School, Bachelor's, Master's). occupation: Occupation of the customer. Transactional Data:
transaction_id: Unique identifier for each transaction. transaction_date: Date of the transaction. product_id: Unique identifier for each product. product_category: Category of the product (e.g., Electronics, Clothing, Groceries). quantity: Quantity of the product purchased. unit_price: Price per unit of the product. discount_applied: Discount applied on the transaction. payment_method: Payment method used (e.g., Credit Card, Debit Card, Cash). store_location: Location of the store where the purchase was made. Customer Behavior Metrics:
avg_purchase_value: Average value of purchases made by the customer. purchase_frequency: Frequency of purchases (e.g., Daily, Weekly, Monthly, Yearly). last_purchase_date: Date of the last purchase made by the customer. avg_discount_used: Average discount percentage used by the customer. preferred_store: Store location most frequently visited by the customer. online_purchases: Number of online purchases made by the customer. in_store_purchases: Number of in-store purchases made by the customer. avg_items_per_transaction: Average number of items per transaction. avg_transaction_value: Average value per transaction. total_returned_items: Total number of items returned by the customer. total_returned_value: Total value of returned items. Sales Data:
total_sales: Total sales amount for each customer over the last year - Target for regression. total_transactions: Total number of transactions made by each customer. total_items_purchased: Total number of items purchased by each customer. total_discounts_received: Total discounts received by each customer. avg_spent_per_category: Average amount spent per product category. max_single_purchase_value: Maximum value of a single purchase. min_single_purchase_value: Minimum value of a single purchase. Product Information:
product_name: Name of the product. product_brand: Brand of the product. product_rating: Customer rating of the product. product_review_count: Number of reviews for the product. product_stock: Stock availability of the product. product_return_rate: Rate at which the product is returned. product_size: Size of the product (if applicable). product_weight: Weight of the product (if applicable). product_color: Color of the product (if applicable). product_material: Material of the product (if applicable). product_manufacture_date: Manufacture date of the product. product_expiry_date: Expiry date of the product (if applicable). product_shelf_life: Shelf life of the product (if applicable). Promotional Data:
promotion_id: Unique identifier for each promotion. promotion_type: Type of promotion (e.g., Buy One Get One Free, 20% Off). promotion_start_date: Start date of the promotion. promotion_end_date: End date of the promotion. promotion_effectiveness: Effectiveness of the promotion (e.g., High, Medium, Low). promotion_channel: Channel through which the promotion was advertised (e.g., Online, In-store, Social Media). promotion_target_audience: Target audience for the promotion (e.g., New Customers, Returning Customers). Geographical Data:
customer_zip_code: Zip code of the customer's residence. customer_city: City of the customer's residence. customer_state: State of the customer's residence. store_zip_code: Zip code of the store. store_city: City where the store is located. store_state: State where the store is located. distance_to_store: Distance from the customer's residence to the store. Seasonal and Temporal Data:
holiday_season: Whether the transaction occurred during a holiday season (Yes/No). season: Season of the year (e.g., Winter, Spring, Summer, Fall). weekend: Whether the transaction occurred on a weekend (Yes/No). Customer Interaction Data:
customer_support_calls: Number of calls made to customer support. email_subscription...
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.
The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.
Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:
Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.
Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.
Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.
Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).
We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.
Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.
Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise