78 datasets found
  1. Student Performance Dataset

    • kaggle.com
    Updated Aug 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ghulam Muhammad Nabeel (2025). Student Performance Dataset [Dataset]. https://www.kaggle.com/datasets/nabeelqureshitiii/student-performance-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 27, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ghulam Muhammad Nabeel
    Description

    📊 Student Performance Dataset (Synthetic, Realistic)

    Overview

    This dataset contains 1000000 rows of realistic student performance data, designed for beginners in Machine Learning to practice Linear Regression, model training, and evaluation techniques.

    Each row represents one student with features like study hours, attendance, class participation, and final score.
    The dataset is small, clean, and structured to be beginner-friendly.

    🔑 Columns Description

    • student_id → Unique identifier for each student.
    • weekly_self_study_hours → Average weekly self-study hours (0–40). Generated using a normal distribution centered around 15 hours.
    • attendance_percentage → Attendance percentage (50–100). Simulated with a normal distribution around 85%.
    • class_participation → Score between 0–10 indicating how actively the student participates in class. Generated from a normal distribution centered around 6.
    • total_score → Final performance score (0–100). Calculated as a function of study hours + random noise, then clipped between 0–100. Stronger correlation with study hours.
    • grade → Categorical label (A, B, C, D, F) derived from total_score.

    📐 Data Generation Logic

    1. Weekly Study Hours: Modeled using a normal distribution (mean ≈ 15, std ≈ 7), capped between 0 and 40 hours.
    2. Scores: More study hours → higher score. Formula:

    Random noise simulates differences in learning ability, motivation, etc.

    1. Attendance & Participation: Independent but realistic variations added.
    2. Grades: Assigned from scores using thresholds:
    • A: ≥ 85
    • B: ≥ 70
    • C: ≥ 55
    • D: ≥ 40
    • F: < 40

    🎯 How to Use This Dataset

    Regression Tasks

    • Predict total_score from weekly_self_study_hours.
    • Train and evaluate Linear Regression models.
    • Extend to multiple regression using attendance_percentage and class_participation.

    Classification Tasks

    • Predict grade (A–F) using study hours, attendance, and participation.

    Model Evaluation Practice

    • Apply train-test split and cross-validation.
    • Evaluate with MAE, RMSE, R².
    • Compare simple vs. multiple regression.

    ✅ This dataset is intentionally kept simple, so that new ML learners can clearly see the relationship between input features (study, attendance, participation) and output (score/grade).

  2. Student Performance Data Set

    • kaggle.com
    zip
    Updated Mar 27, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data-Science Sean (2020). Student Performance Data Set [Dataset]. https://www.kaggle.com/datasets/larsen0966/student-performance-data-set
    Explore at:
    zip(12353 bytes)Available download formats
    Dataset updated
    Mar 27, 2020
    Authors
    Data-Science Sean
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    If this Data Set is useful, and upvote is appreciated. This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd-period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

  3. i

    UTAUT2 Dataset on Gen Z Students Use of Short Learning Videos

    • ieee-dataport.org
    Updated May 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khaled Halimi (2025). UTAUT2 Dataset on Gen Z Students Use of Short Learning Videos [Dataset]. https://ieee-dataport.org/documents/utaut2-dataset-gen-z-students-use-short-learning-videos
    Explore at:
    Dataset updated
    May 7, 2025
    Authors
    Khaled Halimi
    Description

    This dataset contains raw survey data collected from 207 Generation Z students at the University of Guelma

  4. Attendance sheet Data set for University

    • kaggle.com
    zip
    Updated May 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Ali (2023). Attendance sheet Data set for University [Dataset]. https://www.kaggle.com/datasets/ahmedaliraja/attendance-sheet-data-set-for-university
    Explore at:
    zip(608 bytes)Available download formats
    Dataset updated
    May 18, 2023
    Authors
    Ahmed Ali
    Description

    Context: The University Attendance Sheet Dataset is a comprehensive collection of attendance records from various university courses. This dataset is valuable for analyzing student attendance patterns, studying the impact of attendance on academic performance, and exploring factors influencing student engagement. It provides a rich resource for researchers, educators, and students interested in understanding attendance dynamics within a university setting.

    Content: The dataset includes the following information:

    Student ID: A unique identifier for each student. Course ID: A unique identifier for each course. Date: The date of the attendance record. Attendance Status: Indicates whether the student was present, absent, or had an excused absence on a particular date. The dataset contains records from multiple academic semesters, covering a wide range of courses across different disciplines. By examining this dataset, researchers can investigate attendance trends across different courses, identify patterns related to student performance, and explore correlations between attendance and other academic variables.

    Acknowledgements: We would like to express our gratitude to the university administration, faculty members, and students who contributed to the collection and organization of this dataset. Their cooperation and support have made this dataset possible, enabling valuable insights into student attendance dynamics.

    Inspiration: The inspiration behind creating this dataset stems from the recognition of the significant role attendance plays in a student's academic journey. By making this dataset available on Kaggle, we hope to facilitate research and analysis on attendance patterns, identify interventions to improve student engagement, and provide educators with valuable insights to enhance their teaching strategies. We also encourage collaboration and exploration of the dataset to uncover new findings and generate knowledge that can benefit the education community as a whole.

    By leveraging the University Attendance Sheet Dataset, we aspire to contribute to the ongoing efforts to improve student success and foster an environment that promotes active participation and learning within higher education institutions.

  5. f

    Detailed characterization of the dataset.

    • figshare.com
    xls
    Updated Sep 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda (2024). Detailed characterization of the dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0310707.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 26, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.

  6. d

    School Attendance by Student Group and District, 2022-2023

    • catalog.data.gov
    • data.ct.gov
    • +1more
    Updated Sep 15, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.ct.gov (2023). School Attendance by Student Group and District, 2022-2023 [Dataset]. https://catalog.data.gov/dataset/school-attendance-by-student-group-and-district-2022-2023
    Explore at:
    Dataset updated
    Sep 15, 2023
    Dataset provided by
    data.ct.gov
    Description

    This dataset includes the attendance rate for public school students PK-12 by student group and by district during the 2022-2023 school year. Student groups include: Students experiencing homelessness Students with disabilities Students who qualify for free/reduced lunch English learners All high needs students Non-high needs students Students by race/ethnicity (Hispanic/Latino of any race, Black or African American, White, All other races) Attendance rates are provided for each student group by district and for the state. Students who are considered high needs include students who are English language learners, who receive special education, or who qualify for free and reduced lunch. When no attendance data is displayed in a cell, data have been suppressed to safeguard student confidentiality, or to ensure that statistics based on a very small sample size are not interpreted as equally representative as those based on a sufficiently larger sample size. For more information on CSDE data suppression policies, please visit http://edsight.ct.gov/relatedreports/BDCRE%20Data%20Suppression%20Rules.pdf.

  7. Datasets for Sentiment Analysis

    • zenodo.org
    csv
    Updated Dec 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 10, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.

    Below are the datasets specified, along with the details of their references, authors, and download sources.

    ----------- STS-Gold Dataset ----------------

    The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.

    Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.

    File name: sts_gold_tweet.csv

    ----------- Amazon Sales Dataset ----------------

    This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.

    Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)

    Features:

    • product_id - Product ID
    • product_name - Name of the Product
    • category - Category of the Product
    • discounted_price - Discounted Price of the Product
    • actual_price - Actual Price of the Product
    • discount_percentage - Percentage of Discount for the Product
    • rating - Rating of the Product
    • rating_count - Number of people who voted for the Amazon rating
    • about_product - Description about the Product
    • user_id - ID of the user who wrote review for the Product
    • user_name - Name of the user who wrote review for the Product
    • review_id - ID of the user review
    • review_title - Short review
    • review_content - Long review
    • img_link - Image Link of the Product
    • product_link - Official Website Link of the Product

    License: CC BY-NC-SA 4.0

    File name: amazon.csv

    ----------- Rotten Tomatoes Reviews Dataset ----------------

    This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.

    This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).

    Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics

    File name: data_rt.csv

    ----------- Preprocessed Dataset Sentiment Analysis ----------------

    Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
    Stemmed and lemmatized using nltk.
    Sentiment labels are generated using TextBlob polarity scores.

    The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).

    DOI: 10.34740/kaggle/dsv/3877817

    Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }

    This dataset was used in the experimental phase of my research.

    File name: EcoPreprocessed.csv

    ----------- Amazon Earphones Reviews ----------------

    This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)

    License: U.S. Government Works

    Source: www.amazon.in

    File name (original): AllProductReviews.csv (contains 14337 reviews)

    File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)

    ----------- Amazon Musical Instruments Reviews ----------------

    This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).

    Source: http://jmcauley.ucsd.edu/data/amazon/

    File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)

    File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)

  8. Students Data Analysis

    • kaggle.com
    zip
    Updated Jul 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MOMONO (2022). Students Data Analysis [Dataset]. https://www.kaggle.com/datasets/erqizhou/students-data-analysis
    Explore at:
    zip(2174 bytes)Available download formats
    Dataset updated
    Jul 20, 2022
    Authors
    MOMONO
    Description

    A little paragraph from one real dataset, with a few little changes to protect students' private information. Permissions are given.

    Goals

    You are going to help teachers with only the data: 1. Prediction: To tell what makes a brilliant student who can apply for a graduate school, whether abroad or not. 2. Application: To help those who fails to apply for a graduate school with advice in job searching.

    Tips

    1. Educational data may have subtle structures, hierarchies and heterogeneity are probably involved. Simple regressions can hardly make any difference. Also, you should keep an eye on the collinearity in some indicators collected by teachers who have already forgot statistics.
    2. Not all students are free to choose to apply for a graduate school, but some were born with privileges.
    3. Some of the students are trying (or planning to try) to apply for a graduate school for years, you should be responsible to give advice accurately under their circumstances

    About the Data

    Some of the original structure are deleted or censored. For those are left: Basic data like: - ID - class: categorical, initially students were divided into 2 classes, yet teachers suspect that of different classes students may performance significant differently. - gender - race: categorical and censored - GPA: real numbers, float

    Some teachers assume that scores of math curriculums can represent one's likelihood perfectly: - Algebra: real numbers, Advanced Algebra - ......

    Some assume that background of students can affect their choices and likelihood significantly, which are all censored as: - from1: students' home locations - from2: a probably bad indicator for preference on mathematics - from 3: how did students apply for this university (undergraduate) - from4: a probably bad indicator for family background. 0 with more wealth, 4 with more poverty

    The final indicator y: - 0, one fails to apply for the graduate school, who may apply again or search jobs in the future - 1, success, inland - 2, success, abroad

  9. h

    books-tabular-dataset

    • huggingface.co
    Updated Sep 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chenrui (2025). books-tabular-dataset [Dataset]. https://huggingface.co/datasets/EricCRX/books-tabular-dataset
    Explore at:
    Dataset updated
    Sep 27, 2025
    Authors
    Chenrui
    Description

    📄 Model Card: Books Tabular Dataset

      1. Purpose
    

    This dataset was created for educational purposes in the context of Homework 1 (Dealing with Data). The goal is to provide a small but structured tabular dataset that allows students to practice working with real-world features, preprocessing, augmentation, and uploading to Hugging Face. The dataset supports tasks such as classification, exploratory data analysis (EDA), and simple modeling.

      2. Composition… See the full description on the dataset page: https://huggingface.co/datasets/EricCRX/books-tabular-dataset.
    
  10. m

    AR-ASAG-Dataset

    • data.mendeley.com
    • kaggle.com
    Updated Jul 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leila Ouahrani (2020). AR-ASAG-Dataset [Dataset]. http://doi.org/10.17632/dj95jh332j.1
    Explore at:
    Dataset updated
    Jul 1, 2020
    Authors
    Leila Ouahrani
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The ARabic Dataset for Automatic Short Answer Grading Evaluation V1. ISLRN 529-005-230-448-6. Our dataset consists of reported evaluations relate to answers submitted for three different exams submitted to three classes of students. The exams were conducted under natural conditions of evaluation. Each test consists of 16 short answer questions (a total of 48 questions). For each question, a model answer is proposed. Students submitted answers to these questions.
    The number of answers obtained is different from one question to another. The dataset includes a total of 2133 pairs (Model Answer, student answer). the Dataset encompasses 5 types of questions: • "عرف ": Define? • "إشرح": Explain? • "ما النتائج المترتبة على": What consequences? • "علل": Justify? • "ما الفرق": What is the difference

    AR-ASAG Dataset is available in different versions: TXT, XML, XML-MOODLE and Database (.DB).
    The .DB format allows making the necessary exports according to specific analysis needs.
    The XML-MOODLE format is used on Moodle e-learning Platforms For each pair, two grades (Mark1 and Mark2 ) are associated with a manual Average Gold Score Both manual grades are available in the dataset. Inter-Annotators Agreement: - (Pearson Correlation: r=0.8384) - (Root Mean Square Error : RMSE=0.8381). The Dataset can be also used for essay scoring as the students's answers responses take to reach 4-5 sentences. The Dataset exist in TXT, XML, XML-MOODLE Versions The name of the file is representative of its content. We use the term "Mark" to specify "Grade" For privacy reasons, no student identifiers are used in this Dataset.

  11. Sample RAG Knowledge Item Dataset

    • kaggle.com
    Updated Aug 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Hundley (2024). Sample RAG Knowledge Item Dataset [Dataset]. https://www.kaggle.com/datasets/dkhundley/sample-rag-knowledge-item-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 18, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    David Hundley
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This is a small dataset that learners can use for testing out their retrieval augmented generation (RAG) knowledge. New RAG students will learn that their RAG processes can be optimized in many ways, including how the documents are chunked, how chunks are retrieved, and more. This dataset was designed to allow students to experiment with these different strategies.

    This smaller dataset was generated from a larger dataset that I created, which can be found at this link:

    https://www.kaggle.com/datasets/dkhundley/synthetic-it-related-knowledge-items

    This larger dataset represents a set of 100 articles that you might find in a typical Fortune 500’s IT helpdesk. Students are advised to use the larger dataset for a full RAG experimentation, but this smaller dataset provided here contains a focused set of material to test with amongst each of your experiments.

    Both this dataset and the other larger dataset were generated using this Kaggle notebook:

    https://www.kaggle.com/code/dkhundley/generate-synthetic-ki-dataset

  12. m

    SPHERE: Students' performance dataset of conceptual understanding,...

    • data.mendeley.com
    Updated Jan 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Purwoko Haryadi Santoso (2025). SPHERE: Students' performance dataset of conceptual understanding, scientific ability, and learning attitude in physics education research (PER) [Dataset]. http://doi.org/10.17632/88d7m2fv7p.2
    Explore at:
    Dataset updated
    Jan 15, 2025
    Authors
    Purwoko Haryadi Santoso
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The SPHERE is students' performance in physics education research dataset. It is presented as a multi-domain learning dataset of students’ performance on physics that has been collected through several research-based assessments (RBAs) established by the physics education research (PER) community. A total of 497 eleventh-grade students were involved from three large and a small public high school located in a suburban district of a high-populated province in Indonesia. Some variables related to demographics, accessibility to literature resources, and students’ physics identity are also investigated. Some RBAs utilized in this data were selected based on concepts learned by the students in the Indonesian physics curriculum. We commenced the survey of students’ understanding on Newtonian mechanics at the end of the first semester using Force Concept Inventory (FCI) and Force and Motion Conceptual Evaluation (FMCE). In the second semester, we assessed the students’ scientific abilities and learning attitude through Scientific Abilities Assessment Rubrics (SAAR) and the Colorado Learning Attitudes about Science Survey (CLASS) respectively. The conceptual assessments were continued at the second semester measured through Rotational and Rolling Motion Conceptual Survey (RRMCS), Fluid Mechanics Concept Inventory (FMCI), Mechanical Waves Conceptual Survey (MWCS), Thermal Concept Evaluation (TCE), and Survey of Thermodynamic Processes and First and Second Laws (STPFaSL). We expect SPHERE could be a valuable dataset for supporting the advancement of the PER field particularly in quantitative studies. For example, there is a need to help advance research on using machine learning and data mining techniques in PER that might face challenges due to the unavailable dataset for the specific purpose of PER studies. SPHERE can be reused as a students’ performance dataset on physics specifically dedicated for PER scholars which might be willing to implement machine learning techniques in physics education.

  13. f

    Data_Sheet_4_“R” U ready?: a case study using R to analyze changes in gene...

    • frontiersin.figshare.com
    docx
    Updated Mar 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder (2024). Data_Sheet_4_“R” U ready?: a case study using R to analyze changes in gene expression during evolution.docx [Dataset]. http://doi.org/10.3389/feduc.2024.1379910.s004
    Explore at:
    docxAvailable download formats
    Dataset updated
    Mar 22, 2024
    Dataset provided by
    Frontiers
    Authors
    Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.

  14. Data from: Dataset: Mental health and individual differences in the short-...

    • figshare.com
    Updated Dec 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maria Paola Jiménez-Villamizar; Laura Comendador; Josep Maria Losilla; Juan P. Sanabria-Mazo; Corel Mateo-Canedo; Anna Muro; Antoni Sanz (2022). Dataset: Mental health and individual differences in the short- and long-term adaptation processes of university students during the COVID-19 pandemic [Dataset]. http://doi.org/10.6084/m9.figshare.21701228.v1
    Explore at:
    Dataset updated
    Dec 9, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Maria Paola Jiménez-Villamizar; Laura Comendador; Josep Maria Losilla; Juan P. Sanabria-Mazo; Corel Mateo-Canedo; Anna Muro; Antoni Sanz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the dataset derived from the sistematic review describes at https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=330361

  15. f

    F1-score results [40].

    • plos.figshare.com
    xls
    Updated Sep 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda (2024). F1-score results [40]. [Dataset]. http://doi.org/10.1371/journal.pone.0310707.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 26, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.

  16. m

    Data from: Student grade prediction dataset

    • data.mendeley.com
    Updated Jun 16, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nonso Nnamoko (2022). Student grade prediction dataset [Dataset]. http://doi.org/10.17632/wf8568hxb7.1
    Explore at:
    Dataset updated
    Jun 16, 2022
    Authors
    Nonso Nnamoko
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset provides a collection of 160 instances belonging to two classes (pass' = 136 andfail' = 24). The data is an anonymised, statistically sound and reliable representation of the original data collected from students studying computer science modules at a UK University. Each instance is made up of 19 features plus the class label. Eight of the features represent students' online behaviour including bio information retrieved from Virtual Learning Environment. Eleven of the features represent students' neighbourhood influence retrieved from Office for Students database. The data has been compiled and made available in de-facto/de-jure standard open formats (CSV and JSON).

    This data was collected and used in a research study undertaken by academics and researchers at Computer Science Department, Edge Hill University, United Kingdom. To encourage reproducibility of the experiments and results reported, the data is provided in the exact training-validation-testing splits used in the experiments.

  17. d

    School Attendance by District, 2020-2021

    • catalog.data.gov
    • data.ct.gov
    • +2more
    Updated Jun 28, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.ct.gov (2025). School Attendance by District, 2020-2021 [Dataset]. https://catalog.data.gov/dataset/school-attendance-by-district-2020-2021
    Explore at:
    Dataset updated
    Jun 28, 2025
    Dataset provided by
    data.ct.gov
    Description

    This dataset includes the attendance rate for public school students PK-12 by district during the 2020-2021 school year. Attendance rates are provided for each district for the overall student population and for the high needs student population. Students who are considered high needs include students who are English language learners, who receive special education, or who qualify for free and reduced lunch. When no attendance data is displayed in a cell, data have been suppressed to safeguard student confidentiality, or to ensure that statistics based on a very small sample size are not interpreted as equally representative as those based on a sufficiently larger sample size. For more information on CSDE data suppression policies, please visit http://edsight.ct.gov/relatedreports/BDCRE%20Data%20Suppression%20Rules.pdf.

  18. m

    SDFVD: Small-scale Deepfake Forgery Video Dataset

    • data.mendeley.com
    Updated Apr 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shilpa Kaman (2024). SDFVD: Small-scale Deepfake Forgery Video Dataset [Dataset]. http://doi.org/10.17632/bcmkfgct2s.1
    Explore at:
    Dataset updated
    Apr 23, 2024
    Authors
    Shilpa Kaman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Small-scale Deepfake Forgery Video Dataset (SDFVD) is a custom dataset consisting of real and deepfake videos with diverse contexts designed to study and benchmark deepfake detection algorithms. The dataset comprising of a total of 106 videos, with 53 original and 53 deepfake videos. Equal number of real and deepfake videos, ensures balance for machine learning model training and evaluation. The original videos were collected from Pexels: a well- known provider of stock photography and stock footage(video). These videos include a variety of backgrounds, and the subjects represent different genders and ages, reflecting a diverse range of scenarios. The input videos have been pre-processed by cropping them to a length of approximately 4 to 5 seconds and resizing them to 720p resolution, ensuring a consistent and uniform format across the dataset. Deepfake videos were generated using Remaker AI employing face-swapping techniques. Remaker AI is an AI-powered platform that can generate images, swap faces in photos and videos, and edit content. The source face photos for these swaps were taken from Freepik: is an image bank website provides contents such as photographs, illustrations and vector images. SDFVD was created due to the lack of availability of any such comparable small-scale deepfake video datasets. Key benefits of such datasets are: • In educational settings or smaller research labs, smaller datasets can be particularly useful as they require fewer resources, allowing students and researchers to conduct experiments with limited budgets and computational resources. • Researchers can use small-scale datasets to quickly prototype new ideas, test concepts, and refine algorithms before scaling up to larger datasets. Overall, SDFVD offers a compact but diverse collection of real and deepfake videos, suitable for a variety of applications, including research, security, and education. It serves as a valuable resource for exploring the rapidly evolving field of deepfake technology and its impact on society.

  19. p

    Trends in Total Students (2003-2023): Small Middle School

    • publicschoolreview.com
    Updated Oct 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Public School Review (2025). Trends in Total Students (2003-2023): Small Middle School [Dataset]. https://www.publicschoolreview.com/small-middle-school-profile
    Explore at:
    Dataset updated
    Oct 26, 2025
    Dataset authored and provided by
    Public School Review
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset tracks annual total students amount from 2003 to 2023 for Small Middle School

  20. Sample data having columns with the information like ID, Tweet, and Label.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Dec 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Tayyab Zamir; Fida Ullah; Rasikh Tariq; Waqas Haider Bangyal; Muhammad Arif; Alexander Gelbukh (2024). Sample data having columns with the information like ID, Tweet, and Label. [Dataset]. http://doi.org/10.1371/journal.pone.0315407.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 19, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Muhammad Tayyab Zamir; Fida Ullah; Rasikh Tariq; Waqas Haider Bangyal; Muhammad Arif; Alexander Gelbukh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ghulam Muhammad Nabeel (2025). Student Performance Dataset [Dataset]. https://www.kaggle.com/datasets/nabeelqureshitiii/student-performance-dataset
Organization logo

Student Performance Dataset

A generic data for ML Beginners

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 27, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ghulam Muhammad Nabeel
Description

📊 Student Performance Dataset (Synthetic, Realistic)

Overview

This dataset contains 1000000 rows of realistic student performance data, designed for beginners in Machine Learning to practice Linear Regression, model training, and evaluation techniques.

Each row represents one student with features like study hours, attendance, class participation, and final score.
The dataset is small, clean, and structured to be beginner-friendly.

🔑 Columns Description

  • student_id → Unique identifier for each student.
  • weekly_self_study_hours → Average weekly self-study hours (0–40). Generated using a normal distribution centered around 15 hours.
  • attendance_percentage → Attendance percentage (50–100). Simulated with a normal distribution around 85%.
  • class_participation → Score between 0–10 indicating how actively the student participates in class. Generated from a normal distribution centered around 6.
  • total_score → Final performance score (0–100). Calculated as a function of study hours + random noise, then clipped between 0–100. Stronger correlation with study hours.
  • grade → Categorical label (A, B, C, D, F) derived from total_score.

📐 Data Generation Logic

  1. Weekly Study Hours: Modeled using a normal distribution (mean ≈ 15, std ≈ 7), capped between 0 and 40 hours.
  2. Scores: More study hours → higher score. Formula:

Random noise simulates differences in learning ability, motivation, etc.

  1. Attendance & Participation: Independent but realistic variations added.
  2. Grades: Assigned from scores using thresholds:
  • A: ≥ 85
  • B: ≥ 70
  • C: ≥ 55
  • D: ≥ 40
  • F: < 40

🎯 How to Use This Dataset

Regression Tasks

  • Predict total_score from weekly_self_study_hours.
  • Train and evaluate Linear Regression models.
  • Extend to multiple regression using attendance_percentage and class_participation.

Classification Tasks

  • Predict grade (A–F) using study hours, attendance, and participation.

Model Evaluation Practice

  • Apply train-test split and cross-validation.
  • Evaluate with MAE, RMSE, R².
  • Compare simple vs. multiple regression.

✅ This dataset is intentionally kept simple, so that new ML learners can clearly see the relationship between input features (study, attendance, participation) and output (score/grade).

Search
Clear search
Close search
Google apps
Main menu