17 datasets found

Normalization methods and their properties.
plos.figshare.com
datasetcatalog.nlm.nih.gov
+1more
xls
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sofie Van Landeghem; Jari Björne; Chih-Hsuan Wei; Kai Hakala; Sampo Pyysalo; Sophia Ananiadou; Hung-Yu Kao; Zhiyong Lu; Tapio Salakoski; Yves Van de Peer; Filip Ginter (2023). Normalization methods and their properties. [Dataset]. http://doi.org/10.1371/journal.pone.0055814.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0055814.t002
Dataset updated
Jun 3, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Sofie Van Landeghem; Jari Björne; Chih-Hsuan Wei; Kai Hakala; Sampo Pyysalo; Sophia Ananiadou; Hung-Yu Kao; Zhiyong Lu; Tapio Salakoski; Yves Van de Peer; Filip Ginter
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The different normalization methods applied in this study, and whether or not they account for lexical variation, synonymy, orthology and species-specific resolution. By creating combinations of these algorithms, their individual strengths can be aggregated.
Z
Data Analysis for the Systematic Literature Review of DL4SE
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586
Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
College of William and Mary
Washington and Lee University
Authors
Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise
Figure 4 from manuscript Sparsely-Connected Autoencoder (SCA) for single...
figshare.com
zip
Updated Aug 26, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raffaele Calogero (2020). Figure 4 from manuscript Sparsely-Connected Autoencoder (SCA) for single cell RNAseq data mining [Dataset]. http://doi.org/10.6084/m9.figshare.12866717.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12866717.v1
Dataset updated
Aug 26, 2020
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Raffaele Calogero
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset used to generate figure 4: QCM/QCC plots using different normalizations for the SCA input counts table. A) Log10 transformed (figure4/setA/Results/setAMIRNA_SIMLR/5/setA_StabilitySignificativityJittered.pdf), B) Centred log-ratio normalization (CLR) (figure4/setA/Results/CLR_FNMIRNA_SIMLR/5/normalized_CLR_FN_StabilitySignificativityJittered.pdf), C) relative log-expression (RLE) (figure4/setA/Results/DESEQ_FNMIRNA_SIMLR/5/normalized_DESEQ_FN_StabilitySignificativityJittered.pdf), D) full-quantile normalization (FQ) (figure4/setA/Results/FQ_FNMIRNA_SIMLR/5/normalized_FQ_FN_StabilitySignificativityJittered.pdf), E) sum scaling normalization (SUM) (/figure4/setA/Results/SUM_FNMIRNA_SIMLR/5/normalized_SUM_FN_StabilitySignificativityJittered.pdf), F) weighted trimmed mean of M-values (TMM) (figure4/setA/Results/TMM_FNMIRNA_SIMLR/5/normalized_TMM_FN_StabilitySignificativityJittered.pdf).
m
URL-Phish: A Feature-Engineered Dataset for Phishing Detection
data.mendeley.com
Updated Sep 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Linh Dam Minh (2025). URL-Phish: A Feature-Engineered Dataset for Phishing Detection [Dataset]. http://doi.org/10.17632/65z9twcx3r.1
Explore at:
Unique identifier
https://doi.org/10.17632/65z9twcx3r.1
Dataset updated
Sep 29, 2025
Authors
Linh Dam Minh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset, named URL-Phish, is designed for phishing detection research. It contains 111,660 unique URLs divided into: • 100,000 benign samples (label = 0), collected from trusted sources including educational (.edu), governmental (.gov), and top-ranked domains. The benign dataset was obtained from the Research Organization Registry [1]. • 11,660 phishing samples (label = 1), obtained from the PhishTank repository [2] between November 2024 and September 2025. Each URL entry was automatically processed to extract 22 lexical and structural features, such as URL length, domain length, number of subdomains, digit ratio, entropy, and HTTPS usage. In addition, three reference columns (url, dom, tld) are preserved for interpretability. One label column is included (0 = benign, 1 = phishing). A data cleaning step removed duplicates and empty entries, followed by normalization of features to ensure consistency. The dataset is provided in CSV format, with 22 numerical feature columns, 3 string reference columns, and 1 label column (total = 26 columns).

References [1] Research Organization Registry, “ROR Data.” Zenodo, Sept. 22, 2025. doi: 10.5281/ZENODO.6347574. [2] PhishTank, “PhishTank: Join the fight against phishing.” [Online]. Available: https://phishtank.org
m
Data from: Data and systems for medication-related text classification and...
data.mendeley.com
commons.datacite.org
Updated Jul 16, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abeed Sarker (2018). Data and systems for medication-related text classification and concept normalization from Twitter: Insights from the Social Media Mining for Health (SMM4H) 2017 shared task [Dataset]. http://doi.org/10.17632/rxwfb3tysd.1
Explore at:
Unique identifier
https://doi.org/10.17632/rxwfb3tysd.1
Dataset updated
Jul 16, 2018
Authors
Abeed Sarker
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data accompanies the following publication:

Title: Data and systems for medication-related text classification and concept normalization from Twitter: Insights from the Social Media Mining for Health (SMM4H) 2017 shared task

Journal: Journal of the American Medical Informatics Association (JAMIA)

The evaluation data (in addition to the training data) was used for the SMM4H-2017 shared tasks, co-located with AMIA-2017 (Washington DC).
f
Data S1 - Large-Scale Event Extraction from Literature with Multi-Level Gene...
plos.figshare.com
xls
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sofie Van Landeghem; Jari Björne; Chih-Hsuan Wei; Kai Hakala; Sampo Pyysalo; Sophia Ananiadou; Hung-Yu Kao; Zhiyong Lu; Tapio Salakoski; Yves Van de Peer; Filip Ginter (2023). Data S1 - Large-Scale Event Extraction from Literature with Multi-Level Gene Normalization [Dataset]. http://doi.org/10.1371/journal.pone.0055814.s001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0055814.s001
Dataset updated
May 30, 2023
Dataset provided by
PLOS ONE
Authors
Sofie Van Landeghem; Jari Björne; Chih-Hsuan Wei; Kai Hakala; Sampo Pyysalo; Sophia Ananiadou; Hung-Yu Kao; Zhiyong Lu; Tapio Salakoski; Yves Van de Peer; Filip Ginter
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This file provides additional details on the pathway curation use-case, which describes a subsection of the human p53 signaling pathway. In this supplemental file, the data on the full p53 pathway are also provided. (XLS)
h
Iowa In-Season Crop Type Map (2020) - Dataset - NASA Harvest Portal
data.harvestportal.org
Updated Nov 25, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). Iowa In-Season Crop Type Map (2020) - Dataset - NASA Harvest Portal [Dataset]. https://data.harvestportal.org/dataset/iowa-in-season-crop-type-map-2020
Explore at:
Dataset updated
Nov 25, 2020
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Iowa
Description
This map is an in-season crop type map for Iowa commodity crops, predicted based on Harmonized Landsat and Sentinel-2 (HLS) observations through August 2020 using the machine learning method described in Kerner et al., 2020 [1]. Each 30m pixel gives a label of corn (0), soybean (1), or other (2). The "other" class includes crops that are not corn or soybean as well as other land cover types (i.e., "other" means the pixel is not corn or soybean). [1] Kerner, H. R., Sahajpal, R., Skakun, S., Becker-Reshef, I., Barker, B., Hosseini, M. (2020). Resilient In-Season Crop Type Classification in Multispectral Satellite Observations using Growth Stage Normalization. ACM SIGKDD Conference on Knowledge Discovery and Data Mining Workshops, https://arxiv.org/abs/2009.10189.

Wine Quality (UCI) — Clean CSV for ML

kaggle.com

zip

Updated Aug 1, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Nishtha Sharma (2025). Wine Quality (UCI) — Clean CSV for ML [Dataset]. https://www.kaggle.com/datasets/nishtha711/wine-quality-uci-clean-csv-for-ml

Explore at:

zip(10711 bytes)Available download formats

Dataset updated

Aug 1, 2025

Authors

Nishtha Sharma

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

🍷 Wine Quality Dataset — Cleaned + Raw ZIP

This dataset contains the classic Wine Recognition Dataset from the UCI Machine Learning Repository — now presented in two formats:

✅ A cleaned and labeled CSV (wine_clean.csv) for fast ML workflows
🗜️ The original UCI zip (wine_original.zip) for purists and explorers

Perfect for learning K-Nearest Neighbors (KNN), exploring distance metrics like Euclidean, Manhattan, Cosine, and building visual + interactive ML notebooks.

📂 What's Inside?

File	Description
`wine_clean.csv`	Clean version with column names, no missing data, and ready-to-use
`wine.zip`	Raw UCI files: `wine.data`, `wine.names`, etc. for reference or manual parsing

🧪 Features in Clean CSV

Feature	Description
`Class`	Target: Cultivar of wine (1, 2, or 3)
`Alcohol`	Alcohol content
`Malic_Acid`	Malic acid amount
`Ash`	Ash content
`Alcalinity_of_Ash`	Alkalinity of ash
`Magnesium`	Magnesium content
`Total_Phenols`	Total phenol compounds
`Flavanoids`	Flavonoid concentration
`Nonflavanoid_Phenols`	Non-flavonoid phenols
`Proanthocyanins`	Amount of proanthocyanins
`Color_Intensity`	Intensity of wine color
`Hue`	Hue of wine
`OD280_OD315`	Optical density ratio
`Proline`	Proline levels

📈 Why Use This Dataset?

Great for learning classification with distance-based algorithms (e.g., KNN)
Use for visualizations, feature scaling, normalization demos
Combines clean, beginner-friendly data with original reference files
Ideal for building educational projects, GitHub portfolios, and Streamlit apps

🧠 Origin & Credit

📍 Source: UCI Machine Learning Repository
🧪 Collected by: Paulo Cortez et al., University of Minho, Portugal
📝 Citation:
> Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis, J. (2009).
> Modeling wine preferences by data mining from physicochemical properties.
> Decision Support Systems, Elsevier. DOI: 10.24432/C56S3T

🔖 License

Public Domain (CC0) — free to use, remix, and share 🌍

If you're an ML student or early-career data scientist, this dataset is your 🍷 playpen. Dive in!

Z
Onset of mining operations
data-staging.niaid.nih.gov
Updated Mar 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Remelgado, Ruben; Meyer, Carsten (2024). Onset of mining operations [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_8214548
Explore at:
Dataset updated
Mar 17, 2024
Dataset provided by
German Centre for Integrative Biodiversity Research
German Centre for Integrative Biodiversity Research (iDiv)
Authors
Remelgado, Ruben; Meyer, Carsten
Description
Motivation

Maus et al created the first database of the spatial extent of mining areas by mobilizing nearly 20 years of Landsat data. This dataset is imperative for GlobES, as mining areas are specified in the IUCN habitat class scheme. Yet, this dataset is temporally static. To tackle this flaw, we mined the Landsat archive to infer the first observable year of mining.

Approach

For each mining area polygon, we collected 50 random samples within it and 50 random samples along its borders. This was meant to reflect increasing spectral differences between areas within and outside a mining exploration after its onset. Then, for each sample, we used Google Earth Engine to extract spectral profiles for every available acquisition between 1990 and 2020.

After completing the extraction, we estimate mean spectral profiles for each acquisition date, once for the samples “inside” the mining area, and another for those “outside” of it. In this process, we masked pixels afflicted by clouds and cloud shadows using Landsat's quality information.

Using the time-series of mean profiles, at each mining site and for each unique date, we normalized the “inside” and “outside” multi-spectral averages and estimated the Root Mean Square Error (RMSE) between them. The normalization step aimed at emphasizing differences in the shape of the spectral profiles rather than on specific values, which can be related to radiometric innacuracies, or simply to differences in acquisition dates. This resulted in an RMSE time-series for each mining site.

We then used these data to infer the first mining year. To achieve this, we first derived a cumulative sum of the RMSE time-series with the intent of removing noise while preserving abrupt directional changes. For example, if a mine was introduced in a forest, it would drive an increase in the RMSE due to the removal of trees, whereas the outskirts of the mine would remain forested. In this example, the accumulated values would tilt upwards. However, if a mining exploration was accompanied by the removal of vegetation along its outskirts where bare land was common, a downwards shift is RMSE values is more likely as the landscape becomes more homogenization.

To detect the date marking a shift in RMSE values, we used a knee/elbow detection algorithm implemented in the python package kneebow, which uses curve rotation to infer the inflection/deflection point of a time series. Here, downward trends correspond to the elbow and upward trends to the knee. To determine which of these metrics was the most adequate, we use the Area Under the Curve (AUC). An elbow is characterized by a convex shape of a time-series which makes the AUC greater than 50%. However, if the shape of the curve is concave, the knee is the most adequate metric. We limited the detection of shifts to time-series with at least 100 time steps. When below this threshold, we assumed the mine (or the the conditions to sustain it) were present since 1990.

Content

This repository contains the infrastructure used to infer the start of a mining operation, which is organized as following:

00_data - Contains the base data required for the operation, including a SHP file with the mining area outlines, and validation samples.

01_analysis - Contains several outputs of our analysis:

xy.tar.gz - Sample locations for each mining site.

sr.tar.gz - Spectral profiles for each sample location.

mine_start.csv - First year when we detected the start of mining.

02_code - Includes all code used in our analysis.

requirements.txt - Python module requirements that can be fed to pip to replicate our study.

config.yml - Configuration file, including information on the Landsat products used.

Student Academic Performance (Synthetic Dataset)

kaggle.com

zip

Updated Oct 10, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Mamun Hasan (2025). Student Academic Performance (Synthetic Dataset) [Dataset]. https://www.kaggle.com/datasets/mamunhasan2cs/student-academic-performance-synthetic-dataset

Explore at:

zip(9287 bytes)Available download formats

Dataset updated

Oct 10, 2025

Authors

Mamun Hasan

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This dataset is a synthetic collection of student performance data created for data preprocessing, cleaning, and analysis practice in Data Mining and Machine Learning courses. It contains information about 1,020 students, including their study habits, attendance, and test performance, with intentionally introduced missing values, duplicates, and outliers to simulate real-world data issues.

The dataset is suitable for laboratory exercises, assignments, and demonstration of key preprocessing techniques such as:

Handling missing values
Removing duplicates
Detecting and treating outliers
Data normalization and transformation
Encoding categorical variables
Exploratory data analysis (EDA)
Regression Analysis

📊 Columns Description

Column Name	Description
Student_ID	Unique identifier for each student (e.g., S0001, S0002, …)
Age	Age of the student (between 18 and 25 years)
Gender	Gender of the student (Male/Female)
Study_Hours	Average number of study hours per day (contains missing values and outliers)
Attendance(%)	Percentage of class attendance (contains missing values)
Test_Score	Final exam score (0–100 scale)
Grade	Letter grade derived from test scores (`F`, `C`, `B`, `A`, `A+`)

🧠 Example Lab Tasks Using This Dataset:

Identify and impute missing values using mean/median.
Detect and remove duplicate records.
Use IQR or Z-score methods to handle outliers.
Normalize Study_Hours and Test_Score using Min-Max scaling.
Encode categorical variables (Gender, Grade) for model input.
Prepare a clean dataset ready for classification/regression analysis.
Can be used for Limited Regression

🎯 Possible Regression Targets

Test_Score → Predict test score based on study hours, attendance, age, and gender.

🧩 Example Regression Problem

Predict the student’s test score using their study hours, attendance percentage, and age.

🧠 Sample Features: X = ['Age', 'Gender', 'Study_Hours', 'Attendance(%)'] y = ['Test_Score']

You can use:

Linear Regression (for simplicity)
Polynomial Regression (to explore nonlinear patterns)
Decision Tree Regressor or Random Forest Regressor

And analyze feature influence using correlation or SHAP/LIME explainability.

Performance of pGenN & GenNorm on use case data set.
plos.figshare.com
xls
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruoyao Ding; Cecilia N. Arighi; Jung-Youn Lee; Cathy H. Wu; K. Vijay-Shanker (2023). Performance of pGenN & GenNorm on use case data set. [Dataset]. http://doi.org/10.1371/journal.pone.0135305.t009
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0135305.t009
Dataset updated
Jun 3, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Ruoyao Ding; Cecilia N. Arighi; Jung-Youn Lee; Cathy H. Wu; K. Vijay-Shanker
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance of pGenN & GenNorm on use case data set.
m
Code-Mixed Indic Languages with Emoticons for Sarcasm Detection
data.mendeley.com
Updated Oct 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sarah Shaikhh (2025). Code-Mixed Indic Languages with Emoticons for Sarcasm Detection [Dataset]. http://doi.org/10.17632/bdm2y2p3rc.1
Explore at:
Unique identifier
https://doi.org/10.17632/bdm2y2p3rc.1
Dataset updated
Oct 10, 2025
Authors
Sarah Shaikhh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset consists of code-mixed multilingual text data designed for sentiment analysis research. It captures naturally occurring code-mixed patterns combining English with ten Indian languages: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Punjabi, Tamil, Telugu and Urdu. The dataset aims to support studies in multilingual NLP, sentiment classification, and language processing for real-world social media and conversational data. Dataset Description The dataset contains the following attributes: • Text: The original code-mixed text sample. • Sentiment: The corresponding sentiment label (positive, negative, or neutral). • Translated_text: English translation of the original text. • Cleaned_text: Text after preprocessing, including lowercasing, punctuation and stopword removal, and normalization. • Tokens: Tokenized representation of the cleaned text. Preprocessing involved cleaning (removal of punctuation, URLs, and emojis), normalization of repeated characters, language-specific stopword removal, translation to English, and token formation for downstream NLP tasks.
Significant regional volume differences (P
plos.figshare.com
xls
Updated Jun 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yongxia Zhou; Fang Yu; Timothy Duong (2023). Significant regional volume differences (P [Dataset]. http://doi.org/10.1371/journal.pone.0090405.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0090405.t001
Dataset updated
Jun 7, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yongxia Zhou; Fang Yu; Timothy Duong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Note-Data (V1, V2) are mean brain volumes after normalization to the supratentorial volume with a scale factor of 1000, no unit.**Calculated with two-sample t test to obtain original p-value (shown with P
Student Performance and Learning Behavior Dataset
kaggle.com
zip
Updated Sep 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adil Shamim (2025). Student Performance and Learning Behavior Dataset [Dataset]. https://www.kaggle.com/datasets/adilshamim8/student-performance-and-learning-style
Explore at:
zip(78897 bytes)Available download formats
Dataset updated
Sep 4, 2025
Authors
Adil Shamim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset provides a comprehensive view of student performance and learning behavior, integrating academic, demographic, behavioral, and psychological factors.

It was created by merging two publicly available Kaggle datasets, resulting in a unified dataset of 14,003 student records with 16 attributes. All entries are anonymized, with no personally identifiable information.

Key Features

Study behaviors & engagement → StudyHours, Attendance, Extracurricular, AssignmentCompletion, OnlineCourses, Discussions

Resources & environment → Resources, Internet, EduTech

Motivation & psychology → Motivation, StressLevel

Demographics → Gender, Age (18–30 years)

Learning preference → LearningStyle

Performance indicators → ExamScore, FinalGrade

Objectives & Use Cases

The dataset can be used for:

Predictive modeling → Regression/classification of student performance (ExamScore, FinalGrade)

Clustering analysis → Identifying learning behavior groups with K-Means or other unsupervised methods

Educational analytics → Exploring how study habits, stress, and motivation affect outcomes

Adaptive learning research → Linking behavioral patterns to personalized learning pathways

Analysis Pipeline (from original study)

The dataset was analyzed in Python using:

Preprocessing → Encoding, normalization (z-score, Min–Max), deduplication

Clustering → K-Means, Elbow Method, Silhouette Score, Davies–Bouldin Index

Dimensionality Reduction → PCA (2D/3D visualizations)

Statistical Analysis → ANOVA, regression for group differences

Interpretation → Mapping clusters to LearningStyle categories & extracting insights for adaptive learning

File

merged_dataset.csv → 14,003 rows × 16 columns Includes student demographics, behaviors, engagement, learning styles, and performance indicators.

Provenance

Source: Zenodo – Student Performance and Learning Behavior Dataset

Creator: Kamal Najem (2024)

License: CC BY 4.0 (per Zenodo terms)

This dataset is an excellent playground for educational data mining — from clustering and behavioral analytics to predictive modeling and personalized learning applications.
f
List of research on educational data mining.
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Jun 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meshari Alazmi; Nasir Ayub (2025). List of research on educational data mining. [Dataset]. http://doi.org/10.1371/journal.pone.0326966.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0326966.t001
Dataset updated
Jun 30, 2025
Dataset provided by
PLOS ONE
Authors
Meshari Alazmi; Nasir Ayub
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Predicting student performance is crucial for providing personalized support and enhancing academic performance. Advanced machine-learning approaches are being used to understand student performance variables as educational data grows. A big dataset from several Chinese institutions and high schools is used to develop a credible student performance prediction technique. Moreover, the dataset includes 80 features and 200,000 records, and consequently, it represents one of the most extensive data collections available for educational research. Initially, data is passed through preprocessing to address outliers and missing values. In addition, we developed a novel hybrid feature selection model that combined correlation filtering with mutual information, Cross-Validation (CV) along with Recursive Feature Eliminatio (RFE) (R, and stability selection to identify the most impactful features. Moreover, This study develops the proposed EffiXNet, a more refined version of EfficientNet augmented with self-attention mechanisms, dynamic convolutions, improved normalization methods, and Sparrow Search Optimization Algorithm for hyperparameter optimization. The developed model was tested using an 80/20 train-test split, where 160,000 records were used for training and 40,000 for testing. The results reported, including accuracy, precision, recall, and F1-score, are based on the full test dataset. However, for better visualization, the confusion matrices display only a representative subset of test results. Furthermore, the EffiXNet value of AUC amounting to 0.99, a 25% reduction of logarithmic loss relative to the baseline models, precision of 97.8%, F1-score of 98.1%, and reliable optimization of memory usage. Significantly, the developed model showed a consistently high-performance level demonstrated by various metrics, which indicates that it is proficient in capturing intricate data patterns. The key insights the current research provides are the necessity of early intervention and directed training support in the educational domain. The EffiXNet framework offers a robust, scalable, and efficient solution for predicting student performance, with potential applications in academic institutions worldwide.

UCI Air Quality Dataset

kaggle.com

Updated Oct 15, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Daksh Bhalala (2024). UCI Air Quality Dataset [Dataset]. https://www.kaggle.com/datasets/dakshbhalala/uci-air-quality-dataset

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 15, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Daksh Bhalala

License

ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically

Description

Air Quality Measurements Dataset

Description

This dataset encompasses comprehensive air quality measurements collected over several months, focusing on various pollutants. It is intended for use in predictive modeling and data analysis within the fields of environmental science and public health. The data offers valuable insights into the concentration levels of different gases, making it suitable for both regression and classification tasks in machine learning applications.

Features

Feature	Description
Date	The date of the measurement.
Time	The time of the measurement.
CO(GT)	Concentration of carbon monoxide (CO) in the air (µg/m³).
PT08.S1(CO)	Sensor measurement for CO concentration.
NMHC(GT)	Concentration of non-methane hydrocarbons (NMHC) (µg/m³).
C6H6(GT)	Concentration of benzene (C6H6) in the air (µg/m³).
PT08.S2(NMHC)	Sensor measurement for NMHC concentration.
NOx(GT)	Concentration of nitrogen oxides (NOx) in the air (µg/m³).
PT08.S3(NOx)	Sensor measurement for NOx concentration.
NO2(GT)	Concentration of nitrogen dioxide (NO2) in the air (µg/m³).

Statistical Overview

The dataset includes frequency distributions for each feature, categorized into specified ranges. Key statistics include:

CO(GT): Values can range significantly, with minimums around -200 µg/m³.
NOx(GT): Concentration values span various ranges, with some exceeding 2000 µg/m³.

Citation Request

This dataset is publicly available for research purposes. If you use this dataset, please cite it as follows:

[Insert citation details based on the original source of the dataset].

Sources

Created by: [Include authors or organizations responsible for the dataset].

Past Usage

The dataset has been utilized in numerous studies focusing on air quality analysis and its implications for public health. It serves as a foundational resource for applying various data mining techniques to explore pollutant concentrations and their correlations with health outcomes.

Relevant Information

The dataset features temporal measurements related to air quality, enabling the assessment of pollution trends over time. It can be leveraged for both classification and regression tasks, with a focus on data normalization and strategies for handling missing values.

Number of Instances

Total Records: 951 (across specified time frames)

Number of Attributes

Input Attributes: 10 attributes related to air quality measurements.

Missing Attribute Values

Some measurements may be recorded as -200, indicating missing or invalid data points.

User requirements mining methods for scenario design using Quantitative...
figshare.com
xlsx
Updated Oct 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhou Meiqi (2024). User requirements mining methods for scenario design using Quantitative Ethnography——data disclosure [Dataset]. http://doi.org/10.6084/m9.figshare.27231864.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27231864.v1
Dataset updated
Oct 15, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Zhou Meiqi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The study commenced with a questionnaire survey, which yielded a total of 1,742 initial demands from the user group. Subsequently, 231 invalid sample data were eliminated, and 1,511 valid data were obtained. To further enhance the quality of the sample, 120 users were selected for field household research and in-depth interviews using a random sampling method. One-on-one structured interviews were conducted between 10 May 2022 and 25 June 2024 with 120 participants. Additionally, the content of the interviews was adapted to align with the participants' preferences. The duration of each interview was approximately 180 minutes, with the interviews themselves taking place approximately two weeks after the field study. The study yielded a final sample of 120 kitchen space needs explorations, with each participant's statements being coded. In order to safeguard the anonymity of the interviewees and to adhere to data protection regulations, only the 1,000 coded data points, excluding the interviews, are presented in this report. These data are displayed in the form of a statistical table, entitled 'Coded Data from the Research Sample'.In the data processing and analysis stages, an epistemic network analysis Web Tool (version 1.7.0) is employed for the processing and analysis of the coded data. The length of the sliding window is set to six lines, comprising the current line and the preceding five lines. This signifies that the co-occurrence of requisite elements is calculated for each six adjacent interview data lines. An adjacency matrix is constructed, and the resulting adjacency vectors are subsequently accumulated. In order to accommodate the potential discrepancy in the number of data coding rows across different analysis units, a process of normalization is applied to all network data prior to dimension reduction. The singular value decomposition (SVD) method is employed to generate orthogonal dimensions, thereby maximising the variance explained by each dimension. The final map of the kitchen space demand network model has been produced and can be seen in Figures 3 to 6 of the paper.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Sofie Van Landeghem; Jari Björne; Chih-Hsuan Wei; Kai Hakala; Sampo Pyysalo; Sophia Ananiadou; Hung-Yu Kao; Zhiyong Lu; Tapio Salakoski; Yves Van de Peer; Filip Ginter (2023). Normalization methods and their properties. [Dataset]. http://doi.org/10.1371/journal.pone.0055814.t002

Normalization methods and their properties.

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0055814.t002

Dataset updated

Jun 3, 2023

Dataset provided by

PLOShttp://plos.org/

Authors

Sofie Van Landeghem; Jari Björne; Chih-Hsuan Wei; Kai Hakala; Sampo Pyysalo; Sophia Ananiadou; Hung-Yu Kao; Zhiyong Lu; Tapio Salakoski; Yves Van de Peer; Filip Ginter

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The different normalization methods applied in this study, and whether or not they account for lexical variation, synonymy, orthology and species-specific resolution. By creating combinations of these algorithms, their individual strengths can be aggregated.

Clear search

Close search

Google apps

Main menu

Normalization methods and their properties.

Data Analysis for the Systematic Literature Review of DL4SE

Figure 4 from manuscript Sparsely-Connected Autoencoder (SCA) for single...

URL-Phish: A Feature-Engineered Dataset for Phishing Detection

Data from: Data and systems for medication-related text classification and...

Data S1 - Large-Scale Event Extraction from Literature with Multi-Level Gene...

Iowa In-Season Crop Type Map (2020) - Dataset - NASA Harvest Portal

Wine Quality (UCI) — Clean CSV for ML

🍷 Wine Quality Dataset — Cleaned + Raw ZIP

📂 What's Inside?

🧪 Features in Clean CSV

📈 Why Use This Dataset?

🧠 Origin & Credit

🔖 License

Onset of mining operations

Student Academic Performance (Synthetic Dataset)

📊 Columns Description

🧠 Example Lab Tasks Using This Dataset:

🎯 Possible Regression Targets

🧩 Example Regression Problem

Performance of pGenN & GenNorm on use case data set.

Code-Mixed Indic Languages with Emoticons for Sarcasm Detection

Significant regional volume differences (P

Student Performance and Learning Behavior Dataset

Key Features

Objectives & Use Cases

Analysis Pipeline (from original study)

File

Provenance

List of research on educational data mining.

UCI Air Quality Dataset

Air Quality Measurements Dataset

Description

Features

Statistical Overview

Citation Request

Sources

Past Usage

Relevant Information

Number of Instances

Number of Attributes

Missing Attribute Values

User requirements mining methods for scenario design using Quantitative...

Normalization methods and their properties.