17 datasets found
  1. Normalization methods and their properties.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    • +1more
    xls
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sofie Van Landeghem; Jari Björne; Chih-Hsuan Wei; Kai Hakala; Sampo Pyysalo; Sophia Ananiadou; Hung-Yu Kao; Zhiyong Lu; Tapio Salakoski; Yves Van de Peer; Filip Ginter (2023). Normalization methods and their properties. [Dataset]. http://doi.org/10.1371/journal.pone.0055814.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Sofie Van Landeghem; Jari Björne; Chih-Hsuan Wei; Kai Hakala; Sampo Pyysalo; Sophia Ananiadou; Hung-Yu Kao; Zhiyong Lu; Tapio Salakoski; Yves Van de Peer; Filip Ginter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The different normalization methods applied in this study, and whether or not they account for lexical variation, synonymy, orthology and species-specific resolution. By creating combinations of these algorithms, their individual strengths can be aggregated.

  2. Z

    Data Analysis for the Systematic Literature Review of DL4SE

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    College of William and Mary
    Washington and Lee University
    Authors
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

    The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

    Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

    Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

    Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

    Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

    Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

    We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

    Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

    Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise

  3. Figure 4 from manuscript Sparsely-Connected Autoencoder (SCA) for single...

    • figshare.com
    zip
    Updated Aug 26, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raffaele Calogero (2020). Figure 4 from manuscript Sparsely-Connected Autoencoder (SCA) for single cell RNAseq data mining [Dataset]. http://doi.org/10.6084/m9.figshare.12866717.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 26, 2020
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Raffaele Calogero
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset used to generate figure 4: QCM/QCC plots using different normalizations for the SCA input counts table. A) Log10 transformed (figure4/setA/Results/setAMIRNA_SIMLR/5/setA_StabilitySignificativityJittered.pdf), B) Centred log-ratio normalization (CLR) (figure4/setA/Results/CLR_FNMIRNA_SIMLR/5/normalized_CLR_FN_StabilitySignificativityJittered.pdf), C) relative log-expression (RLE) (figure4/setA/Results/DESEQ_FNMIRNA_SIMLR/5/normalized_DESEQ_FN_StabilitySignificativityJittered.pdf), D) full-quantile normalization (FQ) (figure4/setA/Results/FQ_FNMIRNA_SIMLR/5/normalized_FQ_FN_StabilitySignificativityJittered.pdf), E) sum scaling normalization (SUM) (/figure4/setA/Results/SUM_FNMIRNA_SIMLR/5/normalized_SUM_FN_StabilitySignificativityJittered.pdf), F) weighted trimmed mean of M-values (TMM) (figure4/setA/Results/TMM_FNMIRNA_SIMLR/5/normalized_TMM_FN_StabilitySignificativityJittered.pdf).

  4. m

    URL-Phish: A Feature-Engineered Dataset for Phishing Detection

    • data.mendeley.com
    Updated Sep 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Linh Dam Minh (2025). URL-Phish: A Feature-Engineered Dataset for Phishing Detection [Dataset]. http://doi.org/10.17632/65z9twcx3r.1
    Explore at:
    Dataset updated
    Sep 29, 2025
    Authors
    Linh Dam Minh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset, named URL-Phish, is designed for phishing detection research. It contains 111,660 unique URLs divided into: • 100,000 benign samples (label = 0), collected from trusted sources including educational (.edu), governmental (.gov), and top-ranked domains. The benign dataset was obtained from the Research Organization Registry [1]. • 11,660 phishing samples (label = 1), obtained from the PhishTank repository [2] between November 2024 and September 2025. Each URL entry was automatically processed to extract 22 lexical and structural features, such as URL length, domain length, number of subdomains, digit ratio, entropy, and HTTPS usage. In addition, three reference columns (url, dom, tld) are preserved for interpretability. One label column is included (0 = benign, 1 = phishing). A data cleaning step removed duplicates and empty entries, followed by normalization of features to ensure consistency. The dataset is provided in CSV format, with 22 numerical feature columns, 3 string reference columns, and 1 label column (total = 26 columns).

    References [1] Research Organization Registry, “ROR Data.” Zenodo, Sept. 22, 2025. doi: 10.5281/ZENODO.6347574. [2] PhishTank, “PhishTank: Join the fight against phishing.” [Online]. Available: https://phishtank.org

  5. m

    Data from: Data and systems for medication-related text classification and...

    • data.mendeley.com
    • commons.datacite.org
    Updated Jul 16, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abeed Sarker (2018). Data and systems for medication-related text classification and concept normalization from Twitter: Insights from the Social Media Mining for Health (SMM4H) 2017 shared task [Dataset]. http://doi.org/10.17632/rxwfb3tysd.1
    Explore at:
    Dataset updated
    Jul 16, 2018
    Authors
    Abeed Sarker
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data accompanies the following publication:

    Title: Data and systems for medication-related text classification and concept normalization from Twitter: Insights from the Social Media Mining for Health (SMM4H) 2017 shared task

    Journal: Journal of the American Medical Informatics Association (JAMIA)

    The evaluation data (in addition to the training data) was used for the SMM4H-2017 shared tasks, co-located with AMIA-2017 (Washington DC).

  6. f

    Data S1 - Large-Scale Event Extraction from Literature with Multi-Level Gene...

    • plos.figshare.com
    xls
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sofie Van Landeghem; Jari Björne; Chih-Hsuan Wei; Kai Hakala; Sampo Pyysalo; Sophia Ananiadou; Hung-Yu Kao; Zhiyong Lu; Tapio Salakoski; Yves Van de Peer; Filip Ginter (2023). Data S1 - Large-Scale Event Extraction from Literature with Multi-Level Gene Normalization [Dataset]. http://doi.org/10.1371/journal.pone.0055814.s001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Sofie Van Landeghem; Jari Björne; Chih-Hsuan Wei; Kai Hakala; Sampo Pyysalo; Sophia Ananiadou; Hung-Yu Kao; Zhiyong Lu; Tapio Salakoski; Yves Van de Peer; Filip Ginter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This file provides additional details on the pathway curation use-case, which describes a subsection of the human p53 signaling pathway. In this supplemental file, the data on the full p53 pathway are also provided. (XLS)

  7. h

    Iowa In-Season Crop Type Map (2020) - Dataset - NASA Harvest Portal

    • data.harvestportal.org
    Updated Nov 25, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Iowa In-Season Crop Type Map (2020) - Dataset - NASA Harvest Portal [Dataset]. https://data.harvestportal.org/dataset/iowa-in-season-crop-type-map-2020
    Explore at:
    Dataset updated
    Nov 25, 2020
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Iowa
    Description

    This map is an in-season crop type map for Iowa commodity crops, predicted based on Harmonized Landsat and Sentinel-2 (HLS) observations through August 2020 using the machine learning method described in Kerner et al., 2020 [1]. Each 30m pixel gives a label of corn (0), soybean (1), or other (2). The "other" class includes crops that are not corn or soybean as well as other land cover types (i.e., "other" means the pixel is not corn or soybean). [1] Kerner, H. R., Sahajpal, R., Skakun, S., Becker-Reshef, I., Barker, B., Hosseini, M. (2020). Resilient In-Season Crop Type Classification in Multispectral Satellite Observations using Growth Stage Normalization. ACM SIGKDD Conference on Knowledge Discovery and Data Mining Workshops, https://arxiv.org/abs/2009.10189.

  8. Wine Quality (UCI) — Clean CSV for ML

    • kaggle.com
    zip
    Updated Aug 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nishtha Sharma (2025). Wine Quality (UCI) — Clean CSV for ML [Dataset]. https://www.kaggle.com/datasets/nishtha711/wine-quality-uci-clean-csv-for-ml
    Explore at:
    zip(10711 bytes)Available download formats
    Dataset updated
    Aug 1, 2025
    Authors
    Nishtha Sharma
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    🍷 Wine Quality Dataset — Cleaned + Raw ZIP

    This dataset contains the classic Wine Recognition Dataset from the UCI Machine Learning Repository — now presented in two formats:

    • ✅ A cleaned and labeled CSV (wine_clean.csv) for fast ML workflows
    • 🗜️ The original UCI zip (wine_original.zip) for purists and explorers

    Perfect for learning K-Nearest Neighbors (KNN), exploring distance metrics like Euclidean, Manhattan, Cosine, and building visual + interactive ML notebooks.

    📂 What's Inside?

    FileDescription
    wine_clean.csvClean version with column names, no missing data, and ready-to-use
    wine.zipRaw UCI files: wine.data, wine.names, etc. for reference or manual parsing

    🧪 Features in Clean CSV

    FeatureDescription
    ClassTarget: Cultivar of wine (1, 2, or 3)
    AlcoholAlcohol content
    Malic_AcidMalic acid amount
    AshAsh content
    Alcalinity_of_AshAlkalinity of ash
    MagnesiumMagnesium content
    Total_PhenolsTotal phenol compounds
    FlavanoidsFlavonoid concentration
    Nonflavanoid_PhenolsNon-flavonoid phenols
    ProanthocyaninsAmount of proanthocyanins
    Color_IntensityIntensity of wine color
    HueHue of wine
    OD280_OD315Optical density ratio
    ProlineProline levels

    📈 Why Use This Dataset?

    • Great for learning classification with distance-based algorithms (e.g., KNN)
    • Use for visualizations, feature scaling, normalization demos
    • Combines clean, beginner-friendly data with original reference files
    • Ideal for building educational projects, GitHub portfolios, and Streamlit apps

    🧠 Origin & Credit

    • 📍 Source: UCI Machine Learning Repository
    • 🧪 Collected by: Paulo Cortez et al., University of Minho, Portugal
    • 📝 Citation:
      > Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis, J. (2009).
      > Modeling wine preferences by data mining from physicochemical properties.
      > Decision Support Systems, Elsevier. DOI: 10.24432/C56S3T

    🔖 License

    Public Domain (CC0) — free to use, remix, and share 🌍

    If you're an ML student or early-career data scientist, this dataset is your 🍷 playpen. Dive in!

  9. Z

    Onset of mining operations

    • data-staging.niaid.nih.gov
    Updated Mar 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Remelgado, Ruben; Meyer, Carsten (2024). Onset of mining operations [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_8214548
    Explore at:
    Dataset updated
    Mar 17, 2024
    Dataset provided by
    German Centre for Integrative Biodiversity Research
    German Centre for Integrative Biodiversity Research (iDiv)
    Authors
    Remelgado, Ruben; Meyer, Carsten
    Description

    Motivation

    Maus et al created the first database of the spatial extent of mining areas by mobilizing nearly 20 years of Landsat data. This dataset is imperative for GlobES, as mining areas are specified in the IUCN habitat class scheme. Yet, this dataset is temporally static. To tackle this flaw, we mined the Landsat archive to infer the first observable year of mining.

    Approach

    For each mining area polygon, we collected 50 random samples within it and 50 random samples along its borders. This was meant to reflect increasing spectral differences between areas within and outside a mining exploration after its onset. Then, for each sample, we used Google Earth Engine to extract spectral profiles for every available acquisition between 1990 and 2020.

    After completing the extraction, we estimate mean spectral profiles for each acquisition date, once for the samples “inside” the mining area, and another for those “outside” of it. In this process, we masked pixels afflicted by clouds and cloud shadows using Landsat's quality information.

    Using the time-series of mean profiles, at each mining site and for each unique date, we normalized the “inside” and “outside” multi-spectral averages and estimated the Root Mean Square Error (RMSE) between them. The normalization step aimed at emphasizing differences in the shape of the spectral profiles rather than on specific values, which can be related to radiometric innacuracies, or simply to differences in acquisition dates. This resulted in an RMSE time-series for each mining site.

    We then used these data to infer the first mining year. To achieve this, we first derived a cumulative sum of the RMSE time-series with the intent of removing noise while preserving abrupt directional changes. For example, if a mine was introduced in a forest, it would drive an increase in the RMSE due to the removal of trees, whereas the outskirts of the mine would remain forested. In this example, the accumulated values would tilt upwards. However, if a mining exploration was accompanied by the removal of vegetation along its outskirts where bare land was common, a downwards shift is RMSE values is more likely as the landscape becomes more homogenization.

    To detect the date marking a shift in RMSE values, we used a knee/elbow detection algorithm implemented in the python package kneebow, which uses curve rotation to infer the inflection/deflection point of a time series. Here, downward trends correspond to the elbow and upward trends to the knee. To determine which of these metrics was the most adequate, we use the Area Under the Curve (AUC). An elbow is characterized by a convex shape of a time-series which makes the AUC greater than 50%. However, if the shape of the curve is concave, the knee is the most adequate metric. We limited the detection of shifts to time-series with at least 100 time steps. When below this threshold, we assumed the mine (or the the conditions to sustain it) were present since 1990.

    Content

    This repository contains the infrastructure used to infer the start of a mining operation, which is organized as following:

    00_data - Contains the base data required for the operation, including a SHP file with the mining area outlines, and validation samples.

    01_analysis - Contains several outputs of our analysis:

    xy.tar.gz - Sample locations for each mining site.

    sr.tar.gz - Spectral profiles for each sample location.

    mine_start.csv - First year when we detected the start of mining.

    02_code - Includes all code used in our analysis.

    requirements.txt - Python module requirements that can be fed to pip to replicate our study.

    config.yml - Configuration file, including information on the Landsat products used.

  10. Student Academic Performance (Synthetic Dataset)

    • kaggle.com
    zip
    Updated Oct 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mamun Hasan (2025). Student Academic Performance (Synthetic Dataset) [Dataset]. https://www.kaggle.com/datasets/mamunhasan2cs/student-academic-performance-synthetic-dataset
    Explore at:
    zip(9287 bytes)Available download formats
    Dataset updated
    Oct 10, 2025
    Authors
    Mamun Hasan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is a synthetic collection of student performance data created for data preprocessing, cleaning, and analysis practice in Data Mining and Machine Learning courses. It contains information about 1,020 students, including their study habits, attendance, and test performance, with intentionally introduced missing values, duplicates, and outliers to simulate real-world data issues.

    The dataset is suitable for laboratory exercises, assignments, and demonstration of key preprocessing techniques such as:

    • Handling missing values
    • Removing duplicates
    • Detecting and treating outliers
    • Data normalization and transformation
    • Encoding categorical variables
    • Exploratory data analysis (EDA)
    • Regression Analysis

    📊 Columns Description

    Column NameDescription
    Student_IDUnique identifier for each student (e.g., S0001, S0002, …)
    AgeAge of the student (between 18 and 25 years)
    GenderGender of the student (Male/Female)
    Study_HoursAverage number of study hours per day (contains missing values and outliers)
    Attendance(%)Percentage of class attendance (contains missing values)
    Test_ScoreFinal exam score (0–100 scale)
    GradeLetter grade derived from test scores (F, C, B, A, A+)

    🧠 Example Lab Tasks Using This Dataset:

    • Identify and impute missing values using mean/median.
    • Detect and remove duplicate records.
    • Use IQR or Z-score methods to handle outliers.
    • Normalize Study_Hours and Test_Score using Min-Max scaling.
    • Encode categorical variables (Gender, Grade) for model input.
    • Prepare a clean dataset ready for classification/regression analysis.
    • Can be used for Limited Regression

    🎯 Possible Regression Targets

    Test_Score → Predict test score based on study hours, attendance, age, and gender.

    🧩 Example Regression Problem

    Predict the student’s test score using their study hours, attendance percentage, and age.

    🧠 Sample Features: X = ['Age', 'Gender', 'Study_Hours', 'Attendance(%)'] y = ['Test_Score']

    You can use:

    • Linear Regression (for simplicity)
    • Polynomial Regression (to explore nonlinear patterns)
    • Decision Tree Regressor or Random Forest Regressor

    And analyze feature influence using correlation or SHAP/LIME explainability.

  11. Performance of pGenN & GenNorm on use case data set.

    • plos.figshare.com
    xls
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ruoyao Ding; Cecilia N. Arighi; Jung-Youn Lee; Cathy H. Wu; K. Vijay-Shanker (2023). Performance of pGenN & GenNorm on use case data set. [Dataset]. http://doi.org/10.1371/journal.pone.0135305.t009
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Ruoyao Ding; Cecilia N. Arighi; Jung-Youn Lee; Cathy H. Wu; K. Vijay-Shanker
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Performance of pGenN & GenNorm on use case data set.

  12. m

    Code-Mixed Indic Languages with Emoticons for Sarcasm Detection

    • data.mendeley.com
    Updated Oct 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarah Shaikhh (2025). Code-Mixed Indic Languages with Emoticons for Sarcasm Detection [Dataset]. http://doi.org/10.17632/bdm2y2p3rc.1
    Explore at:
    Dataset updated
    Oct 10, 2025
    Authors
    Sarah Shaikhh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset consists of code-mixed multilingual text data designed for sentiment analysis research. It captures naturally occurring code-mixed patterns combining English with ten Indian languages: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Punjabi, Tamil, Telugu and Urdu. The dataset aims to support studies in multilingual NLP, sentiment classification, and language processing for real-world social media and conversational data. Dataset Description The dataset contains the following attributes: • Text: The original code-mixed text sample. • Sentiment: The corresponding sentiment label (positive, negative, or neutral). • Translated_text: English translation of the original text. • Cleaned_text: Text after preprocessing, including lowercasing, punctuation and stopword removal, and normalization. • Tokens: Tokenized representation of the cleaned text. Preprocessing involved cleaning (removal of punctuation, URLs, and emojis), normalization of repeated characters, language-specific stopword removal, translation to English, and token formation for downstream NLP tasks.

  13. Significant regional volume differences (P

    • plos.figshare.com
    xls
    Updated Jun 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yongxia Zhou; Fang Yu; Timothy Duong (2023). Significant regional volume differences (P [Dataset]. http://doi.org/10.1371/journal.pone.0090405.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 7, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Yongxia Zhou; Fang Yu; Timothy Duong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Note-Data (V1, V2) are mean brain volumes after normalization to the supratentorial volume with a scale factor of 1000, no unit.**Calculated with two-sample t test to obtain original p-value (shown with P

  14. Student Performance and Learning Behavior Dataset

    • kaggle.com
    zip
    Updated Sep 4, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adil Shamim (2025). Student Performance and Learning Behavior Dataset [Dataset]. https://www.kaggle.com/datasets/adilshamim8/student-performance-and-learning-style
    Explore at:
    zip(78897 bytes)Available download formats
    Dataset updated
    Sep 4, 2025
    Authors
    Adil Shamim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset provides a comprehensive view of student performance and learning behavior, integrating academic, demographic, behavioral, and psychological factors.

    It was created by merging two publicly available Kaggle datasets, resulting in a unified dataset of 14,003 student records with 16 attributes. All entries are anonymized, with no personally identifiable information.

    Key Features

    • Study behaviors & engagementStudyHours, Attendance, Extracurricular, AssignmentCompletion, OnlineCourses, Discussions
    • Resources & environmentResources, Internet, EduTech
    • Motivation & psychologyMotivation, StressLevel
    • DemographicsGender, Age (18–30 years)
    • Learning preferenceLearningStyle
    • Performance indicatorsExamScore, FinalGrade

    Objectives & Use Cases

    The dataset can be used for:

    • Predictive modeling → Regression/classification of student performance (ExamScore, FinalGrade)
    • Clustering analysis → Identifying learning behavior groups with K-Means or other unsupervised methods
    • Educational analytics → Exploring how study habits, stress, and motivation affect outcomes
    • Adaptive learning research → Linking behavioral patterns to personalized learning pathways

    Analysis Pipeline (from original study)

    The dataset was analyzed in Python using:

    • Preprocessing → Encoding, normalization (z-score, Min–Max), deduplication
    • Clustering → K-Means, Elbow Method, Silhouette Score, Davies–Bouldin Index
    • Dimensionality Reduction → PCA (2D/3D visualizations)
    • Statistical Analysis → ANOVA, regression for group differences
    • Interpretation → Mapping clusters to LearningStyle categories & extracting insights for adaptive learning

    File

    • merged_dataset.csv → 14,003 rows × 16 columns Includes student demographics, behaviors, engagement, learning styles, and performance indicators.

    Provenance

    This dataset is an excellent playground for educational data mining — from clustering and behavioral analytics to predictive modeling and personalized learning applications.

  15. f

    List of research on educational data mining.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Jun 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Meshari Alazmi; Nasir Ayub (2025). List of research on educational data mining. [Dataset]. http://doi.org/10.1371/journal.pone.0326966.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 30, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Meshari Alazmi; Nasir Ayub
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Predicting student performance is crucial for providing personalized support and enhancing academic performance. Advanced machine-learning approaches are being used to understand student performance variables as educational data grows. A big dataset from several Chinese institutions and high schools is used to develop a credible student performance prediction technique. Moreover, the dataset includes 80 features and 200,000 records, and consequently, it represents one of the most extensive data collections available for educational research. Initially, data is passed through preprocessing to address outliers and missing values. In addition, we developed a novel hybrid feature selection model that combined correlation filtering with mutual information, Cross-Validation (CV) along with Recursive Feature Eliminatio (RFE) (R, and stability selection to identify the most impactful features. Moreover, This study develops the proposed EffiXNet, a more refined version of EfficientNet augmented with self-attention mechanisms, dynamic convolutions, improved normalization methods, and Sparrow Search Optimization Algorithm for hyperparameter optimization. The developed model was tested using an 80/20 train-test split, where 160,000 records were used for training and 40,000 for testing. The results reported, including accuracy, precision, recall, and F1-score, are based on the full test dataset. However, for better visualization, the confusion matrices display only a representative subset of test results. Furthermore, the EffiXNet value of AUC amounting to 0.99, a 25% reduction of logarithmic loss relative to the baseline models, precision of 97.8%, F1-score of 98.1%, and reliable optimization of memory usage. Significantly, the developed model showed a consistently high-performance level demonstrated by various metrics, which indicates that it is proficient in capturing intricate data patterns. The key insights the current research provides are the necessity of early intervention and directed training support in the educational domain. The EffiXNet framework offers a robust, scalable, and efficient solution for predicting student performance, with potential applications in academic institutions worldwide.

  16. UCI Air Quality Dataset

    • kaggle.com
    Updated Oct 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daksh Bhalala (2024). UCI Air Quality Dataset [Dataset]. https://www.kaggle.com/datasets/dakshbhalala/uci-air-quality-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 15, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Daksh Bhalala
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    Air Quality Measurements Dataset

    Description

    This dataset encompasses comprehensive air quality measurements collected over several months, focusing on various pollutants. It is intended for use in predictive modeling and data analysis within the fields of environmental science and public health. The data offers valuable insights into the concentration levels of different gases, making it suitable for both regression and classification tasks in machine learning applications.

    Features

    FeatureDescription
    DateThe date of the measurement.
    TimeThe time of the measurement.
    CO(GT)Concentration of carbon monoxide (CO) in the air (µg/m³).
    PT08.S1(CO)Sensor measurement for CO concentration.
    NMHC(GT)Concentration of non-methane hydrocarbons (NMHC) (µg/m³).
    C6H6(GT)Concentration of benzene (C6H6) in the air (µg/m³).
    PT08.S2(NMHC)Sensor measurement for NMHC concentration.
    NOx(GT)Concentration of nitrogen oxides (NOx) in the air (µg/m³).
    PT08.S3(NOx)Sensor measurement for NOx concentration.
    NO2(GT)Concentration of nitrogen dioxide (NO2) in the air (µg/m³).

    Statistical Overview

    The dataset includes frequency distributions for each feature, categorized into specified ranges. Key statistics include:

    • CO(GT): Values can range significantly, with minimums around -200 µg/m³.
    • NOx(GT): Concentration values span various ranges, with some exceeding 2000 µg/m³.

    Citation Request

    This dataset is publicly available for research purposes. If you use this dataset, please cite it as follows:

    [Insert citation details based on the original source of the dataset].

    Sources

    Created by: [Include authors or organizations responsible for the dataset].

    Past Usage

    The dataset has been utilized in numerous studies focusing on air quality analysis and its implications for public health. It serves as a foundational resource for applying various data mining techniques to explore pollutant concentrations and their correlations with health outcomes.

    Relevant Information

    The dataset features temporal measurements related to air quality, enabling the assessment of pollution trends over time. It can be leveraged for both classification and regression tasks, with a focus on data normalization and strategies for handling missing values.

    Number of Instances

    • Total Records: 951 (across specified time frames)

    Number of Attributes

    • Input Attributes: 10 attributes related to air quality measurements.

    Missing Attribute Values

    • Some measurements may be recorded as -200, indicating missing or invalid data points.
  17. User requirements mining methods for scenario design using Quantitative...

    • figshare.com
    xlsx
    Updated Oct 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhou Meiqi (2024). User requirements mining methods for scenario design using Quantitative Ethnography——data disclosure [Dataset]. http://doi.org/10.6084/m9.figshare.27231864.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Oct 15, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Zhou Meiqi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The study commenced with a questionnaire survey, which yielded a total of 1,742 initial demands from the user group. Subsequently, 231 invalid sample data were eliminated, and 1,511 valid data were obtained. To further enhance the quality of the sample, 120 users were selected for field household research and in-depth interviews using a random sampling method. One-on-one structured interviews were conducted between 10 May 2022 and 25 June 2024 with 120 participants. Additionally, the content of the interviews was adapted to align with the participants' preferences. The duration of each interview was approximately 180 minutes, with the interviews themselves taking place approximately two weeks after the field study. The study yielded a final sample of 120 kitchen space needs explorations, with each participant's statements being coded. In order to safeguard the anonymity of the interviewees and to adhere to data protection regulations, only the 1,000 coded data points, excluding the interviews, are presented in this report. These data are displayed in the form of a statistical table, entitled 'Coded Data from the Research Sample'.In the data processing and analysis stages, an epistemic network analysis Web Tool (version 1.7.0) is employed for the processing and analysis of the coded data. The length of the sliding window is set to six lines, comprising the current line and the preceding five lines. This signifies that the co-occurrence of requisite elements is calculated for each six adjacent interview data lines. An adjacency matrix is constructed, and the resulting adjacency vectors are subsequently accumulated. In order to accommodate the potential discrepancy in the number of data coding rows across different analysis units, a process of normalization is applied to all network data prior to dimension reduction. The singular value decomposition (SVD) method is employed to generate orthogonal dimensions, thereby maximising the variance explained by each dimension. The final map of the kitchen space demand network model has been produced and can be seen in Figures 3 to 6 of the paper.

  18. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sofie Van Landeghem; Jari Björne; Chih-Hsuan Wei; Kai Hakala; Sampo Pyysalo; Sophia Ananiadou; Hung-Yu Kao; Zhiyong Lu; Tapio Salakoski; Yves Van de Peer; Filip Ginter (2023). Normalization methods and their properties. [Dataset]. http://doi.org/10.1371/journal.pone.0055814.t002
Organization logo

Normalization methods and their properties.

Related Article
Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
xlsAvailable download formats
Dataset updated
Jun 3, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Sofie Van Landeghem; Jari Björne; Chih-Hsuan Wei; Kai Hakala; Sampo Pyysalo; Sophia Ananiadou; Hung-Yu Kao; Zhiyong Lu; Tapio Salakoski; Yves Van de Peer; Filip Ginter
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The different normalization methods applied in this study, and whether or not they account for lexical variation, synonymy, orthology and species-specific resolution. By creating combinations of these algorithms, their individual strengths can be aggregated.

Search
Clear search
Close search
Google apps
Main menu