Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The different normalization methods applied in this study, and whether or not they account for lexical variation, synonymy, orthology and species-specific resolution. By creating combinations of these algorithms, their individual strengths can be aggregated.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.
The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.
Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:
Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.
Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.
Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.
Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).
We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.
Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.
Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset used to generate figure 4: QCM/QCC plots using different normalizations for the SCA input counts table. A) Log10 transformed (figure4/setA/Results/setAMIRNA_SIMLR/5/setA_StabilitySignificativityJittered.pdf), B) Centred log-ratio normalization (CLR) (figure4/setA/Results/CLR_FNMIRNA_SIMLR/5/normalized_CLR_FN_StabilitySignificativityJittered.pdf), C) relative log-expression (RLE) (figure4/setA/Results/DESEQ_FNMIRNA_SIMLR/5/normalized_DESEQ_FN_StabilitySignificativityJittered.pdf), D) full-quantile normalization (FQ) (figure4/setA/Results/FQ_FNMIRNA_SIMLR/5/normalized_FQ_FN_StabilitySignificativityJittered.pdf), E) sum scaling normalization (SUM) (/figure4/setA/Results/SUM_FNMIRNA_SIMLR/5/normalized_SUM_FN_StabilitySignificativityJittered.pdf), F) weighted trimmed mean of M-values (TMM) (figure4/setA/Results/TMM_FNMIRNA_SIMLR/5/normalized_TMM_FN_StabilitySignificativityJittered.pdf).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset, named URL-Phish, is designed for phishing detection research. It contains 111,660 unique URLs divided into: • 100,000 benign samples (label = 0), collected from trusted sources including educational (.edu), governmental (.gov), and top-ranked domains. The benign dataset was obtained from the Research Organization Registry [1]. • 11,660 phishing samples (label = 1), obtained from the PhishTank repository [2] between November 2024 and September 2025. Each URL entry was automatically processed to extract 22 lexical and structural features, such as URL length, domain length, number of subdomains, digit ratio, entropy, and HTTPS usage. In addition, three reference columns (url, dom, tld) are preserved for interpretability. One label column is included (0 = benign, 1 = phishing). A data cleaning step removed duplicates and empty entries, followed by normalization of features to ensure consistency. The dataset is provided in CSV format, with 22 numerical feature columns, 3 string reference columns, and 1 label column (total = 26 columns).
References [1] Research Organization Registry, “ROR Data.” Zenodo, Sept. 22, 2025. doi: 10.5281/ZENODO.6347574. [2] PhishTank, “PhishTank: Join the fight against phishing.” [Online]. Available: https://phishtank.org
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data accompanies the following publication:
Title: Data and systems for medication-related text classification and concept normalization from Twitter: Insights from the Social Media Mining for Health (SMM4H) 2017 shared task
Journal: Journal of the American Medical Informatics Association (JAMIA)
The evaluation data (in addition to the training data) was used for the SMM4H-2017 shared tasks, co-located with AMIA-2017 (Washington DC).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This file provides additional details on the pathway curation use-case, which describes a subsection of the human p53 signaling pathway. In this supplemental file, the data on the full p53 pathway are also provided. (XLS)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This map is an in-season crop type map for Iowa commodity crops, predicted based on Harmonized Landsat and Sentinel-2 (HLS) observations through August 2020 using the machine learning method described in Kerner et al., 2020 [1]. Each 30m pixel gives a label of corn (0), soybean (1), or other (2). The "other" class includes crops that are not corn or soybean as well as other land cover types (i.e., "other" means the pixel is not corn or soybean). [1] Kerner, H. R., Sahajpal, R., Skakun, S., Becker-Reshef, I., Barker, B., Hosseini, M. (2020). Resilient In-Season Crop Type Classification in Multispectral Satellite Observations using Growth Stage Normalization. ACM SIGKDD Conference on Knowledge Discovery and Data Mining Workshops, https://arxiv.org/abs/2009.10189.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains the classic Wine Recognition Dataset from the UCI Machine Learning Repository — now presented in two formats:
wine_clean.csv) for fast ML workflows wine_original.zip) for purists and explorersPerfect for learning K-Nearest Neighbors (KNN), exploring distance metrics like Euclidean, Manhattan, Cosine, and building visual + interactive ML notebooks.
| File | Description |
|---|---|
wine_clean.csv | Clean version with column names, no missing data, and ready-to-use |
wine.zip | Raw UCI files: wine.data, wine.names, etc. for reference or manual parsing |
| Feature | Description |
|---|---|
Class | Target: Cultivar of wine (1, 2, or 3) |
Alcohol | Alcohol content |
Malic_Acid | Malic acid amount |
Ash | Ash content |
Alcalinity_of_Ash | Alkalinity of ash |
Magnesium | Magnesium content |
Total_Phenols | Total phenol compounds |
Flavanoids | Flavonoid concentration |
Nonflavanoid_Phenols | Non-flavonoid phenols |
Proanthocyanins | Amount of proanthocyanins |
Color_Intensity | Intensity of wine color |
Hue | Hue of wine |
OD280_OD315 | Optical density ratio |
Proline | Proline levels |
Public Domain (CC0) — free to use, remix, and share 🌍
If you're an ML student or early-career data scientist, this dataset is your 🍷 playpen. Dive in!
Facebook
TwitterMotivation
Maus et al created the first database of the spatial extent of mining areas by mobilizing nearly 20 years of Landsat data. This dataset is imperative for GlobES, as mining areas are specified in the IUCN habitat class scheme. Yet, this dataset is temporally static. To tackle this flaw, we mined the Landsat archive to infer the first observable year of mining.
Approach
For each mining area polygon, we collected 50 random samples within it and 50 random samples along its borders. This was meant to reflect increasing spectral differences between areas within and outside a mining exploration after its onset. Then, for each sample, we used Google Earth Engine to extract spectral profiles for every available acquisition between 1990 and 2020.
After completing the extraction, we estimate mean spectral profiles for each acquisition date, once for the samples “inside” the mining area, and another for those “outside” of it. In this process, we masked pixels afflicted by clouds and cloud shadows using Landsat's quality information.
Using the time-series of mean profiles, at each mining site and for each unique date, we normalized the “inside” and “outside” multi-spectral averages and estimated the Root Mean Square Error (RMSE) between them. The normalization step aimed at emphasizing differences in the shape of the spectral profiles rather than on specific values, which can be related to radiometric innacuracies, or simply to differences in acquisition dates. This resulted in an RMSE time-series for each mining site.
We then used these data to infer the first mining year. To achieve this, we first derived a cumulative sum of the RMSE time-series with the intent of removing noise while preserving abrupt directional changes. For example, if a mine was introduced in a forest, it would drive an increase in the RMSE due to the removal of trees, whereas the outskirts of the mine would remain forested. In this example, the accumulated values would tilt upwards. However, if a mining exploration was accompanied by the removal of vegetation along its outskirts where bare land was common, a downwards shift is RMSE values is more likely as the landscape becomes more homogenization.
To detect the date marking a shift in RMSE values, we used a knee/elbow detection algorithm implemented in the python package kneebow, which uses curve rotation to infer the inflection/deflection point of a time series. Here, downward trends correspond to the elbow and upward trends to the knee. To determine which of these metrics was the most adequate, we use the Area Under the Curve (AUC). An elbow is characterized by a convex shape of a time-series which makes the AUC greater than 50%. However, if the shape of the curve is concave, the knee is the most adequate metric. We limited the detection of shifts to time-series with at least 100 time steps. When below this threshold, we assumed the mine (or the the conditions to sustain it) were present since 1990.
Content
This repository contains the infrastructure used to infer the start of a mining operation, which is organized as following:
00_data - Contains the base data required for the operation, including a SHP file with the mining area outlines, and validation samples.
01_analysis - Contains several outputs of our analysis:
xy.tar.gz - Sample locations for each mining site.
sr.tar.gz - Spectral profiles for each sample location.
mine_start.csv - First year when we detected the start of mining.
02_code - Includes all code used in our analysis.
requirements.txt - Python module requirements that can be fed to pip to replicate our study.
config.yml - Configuration file, including information on the Landsat products used.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is a synthetic collection of student performance data created for data preprocessing, cleaning, and analysis practice in Data Mining and Machine Learning courses. It contains information about 1,020 students, including their study habits, attendance, and test performance, with intentionally introduced missing values, duplicates, and outliers to simulate real-world data issues.
The dataset is suitable for laboratory exercises, assignments, and demonstration of key preprocessing techniques such as:
| Column Name | Description |
|---|---|
| Student_ID | Unique identifier for each student (e.g., S0001, S0002, …) |
| Age | Age of the student (between 18 and 25 years) |
| Gender | Gender of the student (Male/Female) |
| Study_Hours | Average number of study hours per day (contains missing values and outliers) |
| Attendance(%) | Percentage of class attendance (contains missing values) |
| Test_Score | Final exam score (0–100 scale) |
| Grade | Letter grade derived from test scores (F, C, B, A, A+) |
Test_Score → Predict test score based on study hours, attendance, age, and gender.
Predict the student’s test score using their study hours, attendance percentage, and age.
🧠 Sample Features: X = ['Age', 'Gender', 'Study_Hours', 'Attendance(%)'] y = ['Test_Score']
You can use:
And analyze feature influence using correlation or SHAP/LIME explainability.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance of pGenN & GenNorm on use case data set.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset consists of code-mixed multilingual text data designed for sentiment analysis research. It captures naturally occurring code-mixed patterns combining English with ten Indian languages: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Punjabi, Tamil, Telugu and Urdu. The dataset aims to support studies in multilingual NLP, sentiment classification, and language processing for real-world social media and conversational data. Dataset Description The dataset contains the following attributes: • Text: The original code-mixed text sample. • Sentiment: The corresponding sentiment label (positive, negative, or neutral). • Translated_text: English translation of the original text. • Cleaned_text: Text after preprocessing, including lowercasing, punctuation and stopword removal, and normalization. • Tokens: Tokenized representation of the cleaned text. Preprocessing involved cleaning (removal of punctuation, URLs, and emojis), normalization of repeated characters, language-specific stopword removal, translation to English, and token formation for downstream NLP tasks.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Note-Data (V1, V2) are mean brain volumes after normalization to the supratentorial volume with a scale factor of 1000, no unit.**Calculated with two-sample t test to obtain original p-value (shown with P
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides a comprehensive view of student performance and learning behavior, integrating academic, demographic, behavioral, and psychological factors.
It was created by merging two publicly available Kaggle datasets, resulting in a unified dataset of 14,003 student records with 16 attributes. All entries are anonymized, with no personally identifiable information.
StudyHours, Attendance, Extracurricular, AssignmentCompletion, OnlineCourses, DiscussionsResources, Internet, EduTechMotivation, StressLevelGender, Age (18–30 years)LearningStyleExamScore, FinalGradeThe dataset can be used for:
ExamScore, FinalGrade)The dataset was analyzed in Python using:
LearningStyle categories & extracting insights for adaptive learningmerged_dataset.csv → 14,003 rows × 16 columns
Includes student demographics, behaviors, engagement, learning styles, and performance indicators.This dataset is an excellent playground for educational data mining — from clustering and behavioral analytics to predictive modeling and personalized learning applications.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Predicting student performance is crucial for providing personalized support and enhancing academic performance. Advanced machine-learning approaches are being used to understand student performance variables as educational data grows. A big dataset from several Chinese institutions and high schools is used to develop a credible student performance prediction technique. Moreover, the dataset includes 80 features and 200,000 records, and consequently, it represents one of the most extensive data collections available for educational research. Initially, data is passed through preprocessing to address outliers and missing values. In addition, we developed a novel hybrid feature selection model that combined correlation filtering with mutual information, Cross-Validation (CV) along with Recursive Feature Eliminatio (RFE) (R, and stability selection to identify the most impactful features. Moreover, This study develops the proposed EffiXNet, a more refined version of EfficientNet augmented with self-attention mechanisms, dynamic convolutions, improved normalization methods, and Sparrow Search Optimization Algorithm for hyperparameter optimization. The developed model was tested using an 80/20 train-test split, where 160,000 records were used for training and 40,000 for testing. The results reported, including accuracy, precision, recall, and F1-score, are based on the full test dataset. However, for better visualization, the confusion matrices display only a representative subset of test results. Furthermore, the EffiXNet value of AUC amounting to 0.99, a 25% reduction of logarithmic loss relative to the baseline models, precision of 97.8%, F1-score of 98.1%, and reliable optimization of memory usage. Significantly, the developed model showed a consistently high-performance level demonstrated by various metrics, which indicates that it is proficient in capturing intricate data patterns. The key insights the current research provides are the necessity of early intervention and directed training support in the educational domain. The EffiXNet framework offers a robust, scalable, and efficient solution for predicting student performance, with potential applications in academic institutions worldwide.
Facebook
TwitterODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
This dataset encompasses comprehensive air quality measurements collected over several months, focusing on various pollutants. It is intended for use in predictive modeling and data analysis within the fields of environmental science and public health. The data offers valuable insights into the concentration levels of different gases, making it suitable for both regression and classification tasks in machine learning applications.
| Feature | Description |
|---|---|
| Date | The date of the measurement. |
| Time | The time of the measurement. |
| CO(GT) | Concentration of carbon monoxide (CO) in the air (µg/m³). |
| PT08.S1(CO) | Sensor measurement for CO concentration. |
| NMHC(GT) | Concentration of non-methane hydrocarbons (NMHC) (µg/m³). |
| C6H6(GT) | Concentration of benzene (C6H6) in the air (µg/m³). |
| PT08.S2(NMHC) | Sensor measurement for NMHC concentration. |
| NOx(GT) | Concentration of nitrogen oxides (NOx) in the air (µg/m³). |
| PT08.S3(NOx) | Sensor measurement for NOx concentration. |
| NO2(GT) | Concentration of nitrogen dioxide (NO2) in the air (µg/m³). |
The dataset includes frequency distributions for each feature, categorized into specified ranges. Key statistics include:
This dataset is publicly available for research purposes. If you use this dataset, please cite it as follows:
[Insert citation details based on the original source of the dataset].
Created by: [Include authors or organizations responsible for the dataset].
The dataset has been utilized in numerous studies focusing on air quality analysis and its implications for public health. It serves as a foundational resource for applying various data mining techniques to explore pollutant concentrations and their correlations with health outcomes.
The dataset features temporal measurements related to air quality, enabling the assessment of pollution trends over time. It can be leveraged for both classification and regression tasks, with a focus on data normalization and strategies for handling missing values.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The study commenced with a questionnaire survey, which yielded a total of 1,742 initial demands from the user group. Subsequently, 231 invalid sample data were eliminated, and 1,511 valid data were obtained. To further enhance the quality of the sample, 120 users were selected for field household research and in-depth interviews using a random sampling method. One-on-one structured interviews were conducted between 10 May 2022 and 25 June 2024 with 120 participants. Additionally, the content of the interviews was adapted to align with the participants' preferences. The duration of each interview was approximately 180 minutes, with the interviews themselves taking place approximately two weeks after the field study. The study yielded a final sample of 120 kitchen space needs explorations, with each participant's statements being coded. In order to safeguard the anonymity of the interviewees and to adhere to data protection regulations, only the 1,000 coded data points, excluding the interviews, are presented in this report. These data are displayed in the form of a statistical table, entitled 'Coded Data from the Research Sample'.In the data processing and analysis stages, an epistemic network analysis Web Tool (version 1.7.0) is employed for the processing and analysis of the coded data. The length of the sliding window is set to six lines, comprising the current line and the preceding five lines. This signifies that the co-occurrence of requisite elements is calculated for each six adjacent interview data lines. An adjacency matrix is constructed, and the resulting adjacency vectors are subsequently accumulated. In order to accommodate the potential discrepancy in the number of data coding rows across different analysis units, a process of normalization is applied to all network data prior to dimension reduction. The singular value decomposition (SVD) method is employed to generate orthogonal dimensions, thereby maximising the variance explained by each dimension. The final map of the kitchen space demand network model has been produced and can be seen in Figures 3 to 6 of the paper.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The different normalization methods applied in this study, and whether or not they account for lexical variation, synonymy, orthology and species-specific resolution. By creating combinations of these algorithms, their individual strengths can be aggregated.