78 datasets found

Student Performance Dataset
kaggle.com
Updated Aug 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ghulam Muhammad Nabeel (2025). Student Performance Dataset [Dataset]. https://www.kaggle.com/datasets/nabeelqureshitiii/student-performance-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 27, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ghulam Muhammad Nabeel
Description
📊 Student Performance Dataset (Synthetic, Realistic)

Overview

This dataset contains 1000000 rows of realistic student performance data, designed for beginners in Machine Learning to practice Linear Regression, model training, and evaluation techniques.

Each row represents one student with features like study hours, attendance, class participation, and final score.
The dataset is small, clean, and structured to be beginner-friendly.

🔑 Columns Description

student_id → Unique identifier for each student.

weekly_self_study_hours → Average weekly self-study hours (0–40). Generated using a normal distribution centered around 15 hours.

attendance_percentage → Attendance percentage (50–100). Simulated with a normal distribution around 85%.

class_participation → Score between 0–10 indicating how actively the student participates in class. Generated from a normal distribution centered around 6.

total_score → Final performance score (0–100). Calculated as a function of study hours + random noise, then clipped between 0–100. Stronger correlation with study hours.

grade → Categorical label (A, B, C, D, F) derived from total_score.

📐 Data Generation Logic

Weekly Study Hours: Modeled using a normal distribution (mean ≈ 15, std ≈ 7), capped between 0 and 40 hours.

Scores: More study hours → higher score. Formula:

Random noise simulates differences in learning ability, motivation, etc.

Attendance & Participation: Independent but realistic variations added.

Grades: Assigned from scores using thresholds:

A: ≥ 85

B: ≥ 70

C: ≥ 55

D: ≥ 40

F: < 40

🎯 How to Use This Dataset

Regression Tasks

Predict total_score from weekly_self_study_hours.

Train and evaluate Linear Regression models.

Extend to multiple regression using attendance_percentage and class_participation.

Classification Tasks

Predict grade (A–F) using study hours, attendance, and participation.

Model Evaluation Practice

Apply train-test split and cross-validation.

Evaluate with MAE, RMSE, R².

Compare simple vs. multiple regression.

✅ This dataset is intentionally kept simple, so that new ML learners can clearly see the relationship between input features (study, attendance, participation) and output (score/grade).
Student Performance Data Set
kaggle.com
zip
Updated Mar 27, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data-Science Sean (2020). Student Performance Data Set [Dataset]. https://www.kaggle.com/datasets/larsen0966/student-performance-data-set
Explore at:
zip(12353 bytes)Available download formats
Dataset updated
Mar 27, 2020
Authors
Data-Science Sean
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
If this Data Set is useful, and upvote is appreciated. This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd-period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).
i
UTAUT2 Dataset on Gen Z Students Use of Short Learning Videos
ieee-dataport.org
Updated May 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khaled Halimi (2025). UTAUT2 Dataset on Gen Z Students Use of Short Learning Videos [Dataset]. https://ieee-dataport.org/documents/utaut2-dataset-gen-z-students-use-short-learning-videos
Explore at:
Dataset updated
May 7, 2025
Authors
Khaled Halimi
Description
This dataset contains raw survey data collected from 207 Generation Z students at the University of Guelma
Attendance sheet Data set for University
kaggle.com
zip
Updated May 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed Ali (2023). Attendance sheet Data set for University [Dataset]. https://www.kaggle.com/datasets/ahmedaliraja/attendance-sheet-data-set-for-university
Explore at:
zip(608 bytes)Available download formats
Dataset updated
May 18, 2023
Authors
Ahmed Ali
Description
Context: The University Attendance Sheet Dataset is a comprehensive collection of attendance records from various university courses. This dataset is valuable for analyzing student attendance patterns, studying the impact of attendance on academic performance, and exploring factors influencing student engagement. It provides a rich resource for researchers, educators, and students interested in understanding attendance dynamics within a university setting.

Content: The dataset includes the following information:

Student ID: A unique identifier for each student. Course ID: A unique identifier for each course. Date: The date of the attendance record. Attendance Status: Indicates whether the student was present, absent, or had an excused absence on a particular date. The dataset contains records from multiple academic semesters, covering a wide range of courses across different disciplines. By examining this dataset, researchers can investigate attendance trends across different courses, identify patterns related to student performance, and explore correlations between attendance and other academic variables.

Acknowledgements: We would like to express our gratitude to the university administration, faculty members, and students who contributed to the collection and organization of this dataset. Their cooperation and support have made this dataset possible, enabling valuable insights into student attendance dynamics.

Inspiration: The inspiration behind creating this dataset stems from the recognition of the significant role attendance plays in a student's academic journey. By making this dataset available on Kaggle, we hope to facilitate research and analysis on attendance patterns, identify interventions to improve student engagement, and provide educators with valuable insights to enhance their teaching strategies. We also encourage collaboration and exploration of the dataset to uncover new findings and generate knowledge that can benefit the education community as a whole.

By leveraging the University Attendance Sheet Dataset, we aspire to contribute to the ongoing efforts to improve student success and foster an environment that promotes active participation and learning within higher education institutions.
f
Detailed characterization of the dataset.
figshare.com
xls
Updated Sep 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda (2024). Detailed characterization of the dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0310707.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310707.t006
Dataset updated
Sep 26, 2024
Dataset provided by
PLOS ONE
Authors
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.
d
School Attendance by Student Group and District, 2022-2023
catalog.data.gov
data.ct.gov
+1more
Updated Sep 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.ct.gov (2023). School Attendance by Student Group and District, 2022-2023 [Dataset]. https://catalog.data.gov/dataset/school-attendance-by-student-group-and-district-2022-2023
Explore at:
Dataset updated
Sep 15, 2023
Dataset provided by
data.ct.gov
Description
This dataset includes the attendance rate for public school students PK-12 by student group and by district during the 2022-2023 school year. Student groups include: Students experiencing homelessness Students with disabilities Students who qualify for free/reduced lunch English learners All high needs students Non-high needs students Students by race/ethnicity (Hispanic/Latino of any race, Black or African American, White, All other races) Attendance rates are provided for each student group by district and for the state. Students who are considered high needs include students who are English language learners, who receive special education, or who qualify for free and reduced lunch. When no attendance data is displayed in a cell, data have been suppressed to safeguard student confidentiality, or to ensure that statistics based on a very small sample size are not interpreted as equally representative as those based on a sufficiently larger sample size. For more information on CSDE data suppression policies, please visit http://edsight.ct.gov/relatedreports/BDCRE%20Data%20Suppression%20Rules.pdf.
Datasets for Sentiment Analysis
zenodo.org
csv
Updated Dec 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10157504
Dataset updated
Dec 10, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.
Below are the datasets specified, along with the details of their references, authors, and download sources.

----------- STS-Gold Dataset ----------------
The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.
Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.
File name: sts_gold_tweet.csv
----------- Amazon Sales Dataset ----------------
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.
Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)
Features:
product_id - Product ID
product_name - Name of the Product
category - Category of the Product
discounted_price - Discounted Price of the Product
actual_price - Actual Price of the Product
discount_percentage - Percentage of Discount for the Product
rating - Rating of the Product
rating_count - Number of people who voted for the Amazon rating
about_product - Description about the Product
user_id - ID of the user who wrote review for the Product
user_name - Name of the user who wrote review for the Product
review_id - ID of the user review
review_title - Short review
review_content - Long review
img_link - Image Link of the Product
product_link - Official Website Link of the Product
License: CC BY-NC-SA 4.0
File name: amazon.csv
----------- Rotten Tomatoes Reviews Dataset ----------------
This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.
This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).
Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics
File name: data_rt.csv
----------- Preprocessed Dataset Sentiment Analysis ----------------
Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.
The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).
DOI: 10.34740/kaggle/dsv/3877817
Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }
This dataset was used in the experimental phase of my research.
File name: EcoPreprocessed.csv
----------- Amazon Earphones Reviews ----------------
This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)
License: U.S. Government Works
Source: www.amazon.in
File name (original): AllProductReviews.csv (contains 14337 reviews)
File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)
----------- Amazon Musical Instruments Reviews ----------------
This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).
Source: http://jmcauley.ucsd.edu/data/amazon/
File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)
File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)
Students Data Analysis
kaggle.com
zip
Updated Jul 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MOMONO (2022). Students Data Analysis [Dataset]. https://www.kaggle.com/datasets/erqizhou/students-data-analysis
Explore at:
zip(2174 bytes)Available download formats
Dataset updated
Jul 20, 2022
Authors
MOMONO
Description
A little paragraph from one real dataset, with a few little changes to protect students' private information. Permissions are given.

Goals

You are going to help teachers with only the data: 1. Prediction: To tell what makes a brilliant student who can apply for a graduate school, whether abroad or not. 2. Application: To help those who fails to apply for a graduate school with advice in job searching.

Tips

Educational data may have subtle structures, hierarchies and heterogeneity are probably involved. Simple regressions can hardly make any difference. Also, you should keep an eye on the collinearity in some indicators collected by teachers who have already forgot statistics.

Not all students are free to choose to apply for a graduate school, but some were born with privileges.

Some of the students are trying (or planning to try) to apply for a graduate school for years, you should be responsible to give advice accurately under their circumstances

About the Data

Some of the original structure are deleted or censored. For those are left: Basic data like: - ID - class: categorical, initially students were divided into 2 classes, yet teachers suspect that of different classes students may performance significant differently. - gender - race: categorical and censored - GPA: real numbers, float

Some teachers assume that scores of math curriculums can represent one's likelihood perfectly: - Algebra: real numbers, Advanced Algebra - ......

Some assume that background of students can affect their choices and likelihood significantly, which are all censored as: - from1: students' home locations - from2: a probably bad indicator for preference on mathematics - from 3: how did students apply for this university (undergraduate) - from4: a probably bad indicator for family background. 0 with more wealth, 4 with more poverty

The final indicator y: - 0, one fails to apply for the graduate school, who may apply again or search jobs in the future - 1, success, inland - 2, success, abroad
h
books-tabular-dataset
huggingface.co
Updated Sep 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chenrui (2025). books-tabular-dataset [Dataset]. https://huggingface.co/datasets/EricCRX/books-tabular-dataset
Explore at:
Dataset updated
Sep 27, 2025
Authors
Chenrui
Description
📄 Model Card: Books Tabular Dataset

1. Purpose

This dataset was created for educational purposes in the context of Homework 1 (Dealing with Data). The goal is to provide a small but structured tabular dataset that allows students to practice working with real-world features, preprocessing, augmentation, and uploading to Hugging Face. The dataset supports tasks such as classification, exploratory data analysis (EDA), and simple modeling.

2. Composition… See the full description on the dataset page: https://huggingface.co/datasets/EricCRX/books-tabular-dataset.
m
AR-ASAG-Dataset
data.mendeley.com
kaggle.com
Updated Jul 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leila Ouahrani (2020). AR-ASAG-Dataset [Dataset]. http://doi.org/10.17632/dj95jh332j.1
Explore at:
Unique identifier
https://doi.org/10.17632/dj95jh332j.1
Dataset updated
Jul 1, 2020
Authors
Leila Ouahrani
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The ARabic Dataset for Automatic Short Answer Grading Evaluation V1. ISLRN 529-005-230-448-6. Our dataset consists of reported evaluations relate to answers submitted for three different exams submitted to three classes of students. The exams were conducted under natural conditions of evaluation. Each test consists of 16 short answer questions (a total of 48 questions). For each question, a model answer is proposed. Students submitted answers to these questions.
The number of answers obtained is different from one question to another. The dataset includes a total of 2133 pairs (Model Answer, student answer). the Dataset encompasses 5 types of questions: • "عرف ": Define? • "إشرح": Explain? • "ما النتائج المترتبة على": What consequences? • "علل": Justify? • "ما الفرق": What is the difference

AR-ASAG Dataset is available in different versions: TXT, XML, XML-MOODLE and Database (.DB).
The .DB format allows making the necessary exports according to specific analysis needs.
The XML-MOODLE format is used on Moodle e-learning Platforms For each pair, two grades (Mark1 and Mark2 ) are associated with a manual Average Gold Score Both manual grades are available in the dataset. Inter-Annotators Agreement: - (Pearson Correlation: r=0.8384) - (Root Mean Square Error : RMSE=0.8381). The Dataset can be also used for essay scoring as the students's answers responses take to reach 4-5 sentences. The Dataset exist in TXT, XML, XML-MOODLE Versions The name of the file is representative of its content. We use the term "Mark" to specify "Grade" For privacy reasons, no student identifiers are used in this Dataset.
Sample RAG Knowledge Item Dataset
kaggle.com
Updated Aug 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Hundley (2024). Sample RAG Knowledge Item Dataset [Dataset]. https://www.kaggle.com/datasets/dkhundley/sample-rag-knowledge-item-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 18, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
David Hundley
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This is a small dataset that learners can use for testing out their retrieval augmented generation (RAG) knowledge. New RAG students will learn that their RAG processes can be optimized in many ways, including how the documents are chunked, how chunks are retrieved, and more. This dataset was designed to allow students to experiment with these different strategies.

This smaller dataset was generated from a larger dataset that I created, which can be found at this link:

https://www.kaggle.com/datasets/dkhundley/synthetic-it-related-knowledge-items

This larger dataset represents a set of 100 articles that you might find in a typical Fortune 500’s IT helpdesk. Students are advised to use the larger dataset for a full RAG experimentation, but this smaller dataset provided here contains a focused set of material to test with amongst each of your experiments.

Both this dataset and the other larger dataset were generated using this Kaggle notebook:

https://www.kaggle.com/code/dkhundley/generate-synthetic-ki-dataset
m
SPHERE: Students' performance dataset of conceptual understanding,...
data.mendeley.com
Updated Jan 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Purwoko Haryadi Santoso (2025). SPHERE: Students' performance dataset of conceptual understanding, scientific ability, and learning attitude in physics education research (PER) [Dataset]. http://doi.org/10.17632/88d7m2fv7p.2
Explore at:
Unique identifier
https://doi.org/10.17632/88d7m2fv7p.2
Dataset updated
Jan 15, 2025
Authors
Purwoko Haryadi Santoso
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The SPHERE is students' performance in physics education research dataset. It is presented as a multi-domain learning dataset of students’ performance on physics that has been collected through several research-based assessments (RBAs) established by the physics education research (PER) community. A total of 497 eleventh-grade students were involved from three large and a small public high school located in a suburban district of a high-populated province in Indonesia. Some variables related to demographics, accessibility to literature resources, and students’ physics identity are also investigated. Some RBAs utilized in this data were selected based on concepts learned by the students in the Indonesian physics curriculum. We commenced the survey of students’ understanding on Newtonian mechanics at the end of the first semester using Force Concept Inventory (FCI) and Force and Motion Conceptual Evaluation (FMCE). In the second semester, we assessed the students’ scientific abilities and learning attitude through Scientific Abilities Assessment Rubrics (SAAR) and the Colorado Learning Attitudes about Science Survey (CLASS) respectively. The conceptual assessments were continued at the second semester measured through Rotational and Rolling Motion Conceptual Survey (RRMCS), Fluid Mechanics Concept Inventory (FMCI), Mechanical Waves Conceptual Survey (MWCS), Thermal Concept Evaluation (TCE), and Survey of Thermodynamic Processes and First and Second Laws (STPFaSL). We expect SPHERE could be a valuable dataset for supporting the advancement of the PER field particularly in quantitative studies. For example, there is a need to help advance research on using machine learning and data mining techniques in PER that might face challenges due to the unavailable dataset for the specific purpose of PER studies. SPHERE can be reused as a students’ performance dataset on physics specifically dedicated for PER scholars which might be willing to implement machine learning techniques in physics education.
f
Data_Sheet_4_“R” U ready?: a case study using R to analyze changes in gene...
frontiersin.figshare.com
docx
Updated Mar 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder (2024). Data_Sheet_4_“R” U ready?: a case study using R to analyze changes in gene expression during evolution.docx [Dataset]. http://doi.org/10.3389/feduc.2024.1379910.s004
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/feduc.2024.1379910.s004
Dataset updated
Mar 22, 2024
Dataset provided by
Frontiers
Authors
Amy E. Pomeroy; Andrea Bixler; Stefanie H. Chen; Jennifer E. Kerr; Todd D. Levine; Elizabeth F. Ryder
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.
Data from: Dataset: Mental health and individual differences in the short-...
figshare.com
Updated Dec 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maria Paola Jiménez-Villamizar; Laura Comendador; Josep Maria Losilla; Juan P. Sanabria-Mazo; Corel Mateo-Canedo; Anna Muro; Antoni Sanz (2022). Dataset: Mental health and individual differences in the short- and long-term adaptation processes of university students during the COVID-19 pandemic [Dataset]. http://doi.org/10.6084/m9.figshare.21701228.v1
Explore at:
Unique identifier
https://doi.org/10.6084/m9.figshare.21701228.v1
Dataset updated
Dec 9, 2022
Dataset provided by
Figsharehttp://figshare.com/
Authors
Maria Paola Jiménez-Villamizar; Laura Comendador; Josep Maria Losilla; Juan P. Sanabria-Mazo; Corel Mateo-Canedo; Anna Muro; Antoni Sanz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the dataset derived from the sistematic review describes at https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=330361
f
F1-score results [40].
plos.figshare.com
xls
Updated Sep 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda (2024). F1-score results [40]. [Dataset]. http://doi.org/10.1371/journal.pone.0310707.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310707.t002
Dataset updated
Sep 26, 2024
Dataset provided by
PLOS ONE
Authors
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.
m
Data from: Student grade prediction dataset
data.mendeley.com
Updated Jun 16, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nonso Nnamoko (2022). Student grade prediction dataset [Dataset]. http://doi.org/10.17632/wf8568hxb7.1
Explore at:
Unique identifier
https://doi.org/10.17632/wf8568hxb7.1
Dataset updated
Jun 16, 2022
Authors
Nonso Nnamoko
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset provides a collection of 160 instances belonging to two classes (pass' = 136 andfail' = 24). The data is an anonymised, statistically sound and reliable representation of the original data collected from students studying computer science modules at a UK University. Each instance is made up of 19 features plus the class label. Eight of the features represent students' online behaviour including bio information retrieved from Virtual Learning Environment. Eleven of the features represent students' neighbourhood influence retrieved from Office for Students database. The data has been compiled and made available in de-facto/de-jure standard open formats (CSV and JSON).

This data was collected and used in a research study undertaken by academics and researchers at Computer Science Department, Edge Hill University, United Kingdom. To encourage reproducibility of the experiments and results reported, the data is provided in the exact training-validation-testing splits used in the experiments.
d
School Attendance by District, 2020-2021
catalog.data.gov
data.ct.gov
+2more
Updated Jun 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.ct.gov (2025). School Attendance by District, 2020-2021 [Dataset]. https://catalog.data.gov/dataset/school-attendance-by-district-2020-2021
Explore at:
Dataset updated
Jun 28, 2025
Dataset provided by
data.ct.gov
Description
This dataset includes the attendance rate for public school students PK-12 by district during the 2020-2021 school year. Attendance rates are provided for each district for the overall student population and for the high needs student population. Students who are considered high needs include students who are English language learners, who receive special education, or who qualify for free and reduced lunch. When no attendance data is displayed in a cell, data have been suppressed to safeguard student confidentiality, or to ensure that statistics based on a very small sample size are not interpreted as equally representative as those based on a sufficiently larger sample size. For more information on CSDE data suppression policies, please visit http://edsight.ct.gov/relatedreports/BDCRE%20Data%20Suppression%20Rules.pdf.
m
SDFVD: Small-scale Deepfake Forgery Video Dataset
data.mendeley.com
Updated Apr 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shilpa Kaman (2024). SDFVD: Small-scale Deepfake Forgery Video Dataset [Dataset]. http://doi.org/10.17632/bcmkfgct2s.1
Explore at:
Unique identifier
https://doi.org/10.17632/bcmkfgct2s.1
Dataset updated
Apr 23, 2024
Authors
Shilpa Kaman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Small-scale Deepfake Forgery Video Dataset (SDFVD) is a custom dataset consisting of real and deepfake videos with diverse contexts designed to study and benchmark deepfake detection algorithms. The dataset comprising of a total of 106 videos, with 53 original and 53 deepfake videos. Equal number of real and deepfake videos, ensures balance for machine learning model training and evaluation. The original videos were collected from Pexels: a well- known provider of stock photography and stock footage(video). These videos include a variety of backgrounds, and the subjects represent different genders and ages, reflecting a diverse range of scenarios. The input videos have been pre-processed by cropping them to a length of approximately 4 to 5 seconds and resizing them to 720p resolution, ensuring a consistent and uniform format across the dataset. Deepfake videos were generated using Remaker AI employing face-swapping techniques. Remaker AI is an AI-powered platform that can generate images, swap faces in photos and videos, and edit content. The source face photos for these swaps were taken from Freepik: is an image bank website provides contents such as photographs, illustrations and vector images. SDFVD was created due to the lack of availability of any such comparable small-scale deepfake video datasets. Key benefits of such datasets are: • In educational settings or smaller research labs, smaller datasets can be particularly useful as they require fewer resources, allowing students and researchers to conduct experiments with limited budgets and computational resources. • Researchers can use small-scale datasets to quickly prototype new ideas, test concepts, and refine algorithms before scaling up to larger datasets. Overall, SDFVD offers a compact but diverse collection of real and deepfake videos, suitable for a variety of applications, including research, security, and education. It serves as a valuable resource for exploring the rapidly evolving field of deepfake technology and its impact on society.
p
Trends in Total Students (2003-2023): Small Middle School
publicschoolreview.com
Updated Oct 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Public School Review (2025). Trends in Total Students (2003-2023): Small Middle School [Dataset]. https://www.publicschoolreview.com/small-middle-school-profile
Explore at:
Dataset updated
Oct 26, 2025
Dataset authored and provided by
Public School Review
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset tracks annual total students amount from 2003 to 2023 for Small Middle School
Sample data having columns with the information like ID, Tweet, and Label.
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Dec 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Tayyab Zamir; Fida Ullah; Rasikh Tariq; Waqas Haider Bangyal; Muhammad Arif; Alexander Gelbukh (2024). Sample data having columns with the information like ID, Tweet, and Label. [Dataset]. http://doi.org/10.1371/journal.pone.0315407.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0315407.t002
Dataset updated
Dec 19, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Muhammad Tayyab Zamir; Fida Ullah; Rasikh Tariq; Waqas Haider Bangyal; Muhammad Arif; Alexander Gelbukh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Adapted from: [https://www.kaggle.com/datasets/csmalarkodi/covid-fake-news-dataset].

Facebook

Twitter

Click to copy link

Link copied

Cite

Ghulam Muhammad Nabeel (2025). Student Performance Dataset [Dataset]. https://www.kaggle.com/datasets/nabeelqureshitiii/student-performance-dataset

Student Performance Dataset

A generic data for ML Beginners

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 27, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Ghulam Muhammad Nabeel

Description

📊 Student Performance Dataset (Synthetic, Realistic)

Overview

This dataset contains 1000000 rows of realistic student performance data, designed for beginners in Machine Learning to practice Linear Regression, model training, and evaluation techniques.

Each row represents one student with features like study hours, attendance, class participation, and final score.
The dataset is small, clean, and structured to be beginner-friendly.

🔑 Columns Description

student_id → Unique identifier for each student.
weekly_self_study_hours → Average weekly self-study hours (0–40). Generated using a normal distribution centered around 15 hours.
attendance_percentage → Attendance percentage (50–100). Simulated with a normal distribution around 85%.
class_participation → Score between 0–10 indicating how actively the student participates in class. Generated from a normal distribution centered around 6.
total_score → Final performance score (0–100). Calculated as a function of study hours + random noise, then clipped between 0–100. Stronger correlation with study hours.
grade → Categorical label (A, B, C, D, F) derived from total_score.

📐 Data Generation Logic

Weekly Study Hours: Modeled using a normal distribution (mean ≈ 15, std ≈ 7), capped between 0 and 40 hours.
Scores: More study hours → higher score. Formula:

Random noise simulates differences in learning ability, motivation, etc.

Attendance & Participation: Independent but realistic variations added.
Grades: Assigned from scores using thresholds:

A: ≥ 85
B: ≥ 70
C: ≥ 55
D: ≥ 40
F: < 40

🎯 How to Use This Dataset

Regression Tasks

Predict total_score from weekly_self_study_hours.
Train and evaluate Linear Regression models.
Extend to multiple regression using attendance_percentage and class_participation.

Classification Tasks

Predict grade (A–F) using study hours, attendance, and participation.

Model Evaluation Practice

Apply train-test split and cross-validation.
Evaluate with MAE, RMSE, R².
Compare simple vs. multiple regression.

✅ This dataset is intentionally kept simple, so that new ML learners can clearly see the relationship between input features (study, attendance, participation) and output (score/grade).

Clear search

Close search

Google apps

Main menu

Student Performance Dataset

📊 Student Performance Dataset (Synthetic, Realistic)

Overview

🔑 Columns Description

📐 Data Generation Logic

🎯 How to Use This Dataset

Student Performance Data Set

UTAUT2 Dataset on Gen Z Students Use of Short Learning Videos

Attendance sheet Data set for University

Detailed characterization of the dataset.

School Attendance by Student Group and District, 2022-2023

Datasets for Sentiment Analysis

Students Data Analysis

Goals

Tips

About the Data

books-tabular-dataset

AR-ASAG-Dataset

Sample RAG Knowledge Item Dataset

SPHERE: Students' performance dataset of conceptual understanding,...

Data_Sheet_4_“R” U ready?: a case study using R to analyze changes in gene...

Data from: Dataset: Mental health and individual differences in the short-...

F1-score results [40].

Data from: Student grade prediction dataset

School Attendance by District, 2020-2021

SDFVD: Small-scale Deepfake Forgery Video Dataset

Trends in Total Students (2003-2023): Small Middle School

Sample data having columns with the information like ID, Tweet, and Label.

Student Performance Dataset

A generic data for ML Beginners

📊 Student Performance Dataset (Synthetic, Realistic)

Overview

🔑 Columns Description

📐 Data Generation Logic

🎯 How to Use This Dataset