Facebook
TwitterThis dataset contains 1000000 rows of realistic student performance data, designed for beginners in Machine Learning to practice Linear Regression, model training, and evaluation techniques.
Each row represents one student with features like study hours, attendance, class participation, and final score.
The dataset is small, clean, and structured to be beginner-friendly.
Random noise simulates differences in learning ability, motivation, etc.
Regression Tasks
total_score from weekly_self_study_hours. attendance_percentage and class_participation. Classification Tasks
grade (A–F) using study hours, attendance, and participation. Model Evaluation Practice
✅ This dataset is intentionally kept simple, so that new ML learners can clearly see the relationship between input features (study, attendance, participation) and output (score/grade).
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
If this Data Set is useful, and upvote is appreciated. This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd-period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).
Facebook
TwitterThis dataset contains raw survey data collected from 207 Generation Z students at the University of Guelma
Facebook
TwitterContext: The University Attendance Sheet Dataset is a comprehensive collection of attendance records from various university courses. This dataset is valuable for analyzing student attendance patterns, studying the impact of attendance on academic performance, and exploring factors influencing student engagement. It provides a rich resource for researchers, educators, and students interested in understanding attendance dynamics within a university setting.
Content: The dataset includes the following information:
Student ID: A unique identifier for each student. Course ID: A unique identifier for each course. Date: The date of the attendance record. Attendance Status: Indicates whether the student was present, absent, or had an excused absence on a particular date. The dataset contains records from multiple academic semesters, covering a wide range of courses across different disciplines. By examining this dataset, researchers can investigate attendance trends across different courses, identify patterns related to student performance, and explore correlations between attendance and other academic variables.
Acknowledgements: We would like to express our gratitude to the university administration, faculty members, and students who contributed to the collection and organization of this dataset. Their cooperation and support have made this dataset possible, enabling valuable insights into student attendance dynamics.
Inspiration: The inspiration behind creating this dataset stems from the recognition of the significant role attendance plays in a student's academic journey. By making this dataset available on Kaggle, we hope to facilitate research and analysis on attendance patterns, identify interventions to improve student engagement, and provide educators with valuable insights to enhance their teaching strategies. We also encourage collaboration and exploration of the dataset to uncover new findings and generate knowledge that can benefit the education community as a whole.
By leveraging the University Attendance Sheet Dataset, we aspire to contribute to the ongoing efforts to improve student success and foster an environment that promotes active participation and learning within higher education institutions.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.
Facebook
TwitterThis dataset includes the attendance rate for public school students PK-12 by student group and by district during the 2022-2023 school year. Student groups include: Students experiencing homelessness Students with disabilities Students who qualify for free/reduced lunch English learners All high needs students Non-high needs students Students by race/ethnicity (Hispanic/Latino of any race, Black or African American, White, All other races) Attendance rates are provided for each student group by district and for the state. Students who are considered high needs include students who are English language learners, who receive special education, or who qualify for free and reduced lunch. When no attendance data is displayed in a cell, data have been suppressed to safeguard student confidentiality, or to ensure that statistics based on a very small sample size are not interpreted as equally representative as those based on a sufficiently larger sample size. For more information on CSDE data suppression policies, please visit http://edsight.ct.gov/relatedreports/BDCRE%20Data%20Suppression%20Rules.pdf.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.
Below are the datasets specified, along with the details of their references, authors, and download sources.
----------- STS-Gold Dataset ----------------
The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.
Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.
File name: sts_gold_tweet.csv
----------- Amazon Sales Dataset ----------------
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.
Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)
Features:
License: CC BY-NC-SA 4.0
File name: amazon.csv
----------- Rotten Tomatoes Reviews Dataset ----------------
This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.
This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).
Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics
File name: data_rt.csv
----------- Preprocessed Dataset Sentiment Analysis ----------------
Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.
The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).
DOI: 10.34740/kaggle/dsv/3877817
Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }
This dataset was used in the experimental phase of my research.
File name: EcoPreprocessed.csv
----------- Amazon Earphones Reviews ----------------
This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)
License: U.S. Government Works
Source: www.amazon.in
File name (original): AllProductReviews.csv (contains 14337 reviews)
File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)
----------- Amazon Musical Instruments Reviews ----------------
This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).
Source: http://jmcauley.ucsd.edu/data/amazon/
File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)
File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)
Facebook
TwitterA little paragraph from one real dataset, with a few little changes to protect students' private information. Permissions are given.
You are going to help teachers with only the data: 1. Prediction: To tell what makes a brilliant student who can apply for a graduate school, whether abroad or not. 2. Application: To help those who fails to apply for a graduate school with advice in job searching.
Some of the original structure are deleted or censored. For those are left: Basic data like: - ID - class: categorical, initially students were divided into 2 classes, yet teachers suspect that of different classes students may performance significant differently. - gender - race: categorical and censored - GPA: real numbers, float
Some teachers assume that scores of math curriculums can represent one's likelihood perfectly: - Algebra: real numbers, Advanced Algebra - ......
Some assume that background of students can affect their choices and likelihood significantly, which are all censored as: - from1: students' home locations - from2: a probably bad indicator for preference on mathematics - from 3: how did students apply for this university (undergraduate) - from4: a probably bad indicator for family background. 0 with more wealth, 4 with more poverty
The final indicator y: - 0, one fails to apply for the graduate school, who may apply again or search jobs in the future - 1, success, inland - 2, success, abroad
Facebook
Twitter📄 Model Card: Books Tabular Dataset
1. Purpose
This dataset was created for educational purposes in the context of Homework 1 (Dealing with Data). The goal is to provide a small but structured tabular dataset that allows students to practice working with real-world features, preprocessing, augmentation, and uploading to Hugging Face. The dataset supports tasks such as classification, exploratory data analysis (EDA), and simple modeling.
2. Composition… See the full description on the dataset page: https://huggingface.co/datasets/EricCRX/books-tabular-dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The ARabic Dataset for Automatic Short Answer Grading Evaluation V1. ISLRN 529-005-230-448-6.
Our dataset consists of reported evaluations relate to answers submitted for three different exams submitted to three classes of students.
The exams were conducted under natural conditions of evaluation.
Each test consists of 16 short answer questions (a total of 48 questions).
For each question, a model answer is proposed.
Students submitted answers to these questions.
The number of answers obtained is different from one question to another.
The dataset includes a total of 2133 pairs (Model Answer, student answer).
the Dataset encompasses 5 types of questions:
• "عرف ": Define?
• "إشرح": Explain?
• "ما النتائج المترتبة على": What consequences?
• "علل": Justify?
• "ما الفرق": What is the difference
AR-ASAG Dataset is available in different versions: TXT, XML, XML-MOODLE and Database (.DB).
The .DB format allows making the necessary exports according to specific analysis needs.
The XML-MOODLE format is used on Moodle e-learning Platforms
For each pair, two grades (Mark1 and Mark2 ) are associated with a manual Average Gold Score
Both manual grades are available in the dataset.
Inter-Annotators Agreement: - (Pearson Correlation: r=0.8384)
- (Root Mean Square Error : RMSE=0.8381).
The Dataset can be also used for essay scoring as the students's answers responses take to reach 4-5 sentences.
The Dataset exist in TXT, XML, XML-MOODLE Versions
The name of the file is representative of its content.
We use the term "Mark" to specify "Grade"
For privacy reasons, no student identifiers are used in this Dataset.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This is a small dataset that learners can use for testing out their retrieval augmented generation (RAG) knowledge. New RAG students will learn that their RAG processes can be optimized in many ways, including how the documents are chunked, how chunks are retrieved, and more. This dataset was designed to allow students to experiment with these different strategies.
This smaller dataset was generated from a larger dataset that I created, which can be found at this link:
https://www.kaggle.com/datasets/dkhundley/synthetic-it-related-knowledge-items
This larger dataset represents a set of 100 articles that you might find in a typical Fortune 500’s IT helpdesk. Students are advised to use the larger dataset for a full RAG experimentation, but this smaller dataset provided here contains a focused set of material to test with amongst each of your experiments.
Both this dataset and the other larger dataset were generated using this Kaggle notebook:
https://www.kaggle.com/code/dkhundley/generate-synthetic-ki-dataset
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The SPHERE is students' performance in physics education research dataset. It is presented as a multi-domain learning dataset of students’ performance on physics that has been collected through several research-based assessments (RBAs) established by the physics education research (PER) community. A total of 497 eleventh-grade students were involved from three large and a small public high school located in a suburban district of a high-populated province in Indonesia. Some variables related to demographics, accessibility to literature resources, and students’ physics identity are also investigated. Some RBAs utilized in this data were selected based on concepts learned by the students in the Indonesian physics curriculum. We commenced the survey of students’ understanding on Newtonian mechanics at the end of the first semester using Force Concept Inventory (FCI) and Force and Motion Conceptual Evaluation (FMCE). In the second semester, we assessed the students’ scientific abilities and learning attitude through Scientific Abilities Assessment Rubrics (SAAR) and the Colorado Learning Attitudes about Science Survey (CLASS) respectively. The conceptual assessments were continued at the second semester measured through Rotational and Rolling Motion Conceptual Survey (RRMCS), Fluid Mechanics Concept Inventory (FMCI), Mechanical Waves Conceptual Survey (MWCS), Thermal Concept Evaluation (TCE), and Survey of Thermodynamic Processes and First and Second Laws (STPFaSL). We expect SPHERE could be a valuable dataset for supporting the advancement of the PER field particularly in quantitative studies. For example, there is a need to help advance research on using machine learning and data mining techniques in PER that might face challenges due to the unavailable dataset for the specific purpose of PER studies. SPHERE can be reused as a students’ performance dataset on physics specifically dedicated for PER scholars which might be willing to implement machine learning techniques in physics education.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the dataset derived from the sistematic review describes at https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=330361
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset provides a collection of 160 instances belonging to two classes (pass' = 136 andfail' = 24). The data is an anonymised, statistically sound and reliable representation of the original data collected from students studying computer science modules at a UK University. Each instance is made up of 19 features plus the class label. Eight of the features represent students' online behaviour including bio information retrieved from Virtual Learning Environment. Eleven of the features represent students' neighbourhood influence retrieved from Office for Students database. The data has been compiled and made available in de-facto/de-jure standard open formats (CSV and JSON).
This data was collected and used in a research study undertaken by academics and researchers at Computer Science Department, Edge Hill University, United Kingdom. To encourage reproducibility of the experiments and results reported, the data is provided in the exact training-validation-testing splits used in the experiments.
Facebook
TwitterThis dataset includes the attendance rate for public school students PK-12 by district during the 2020-2021 school year. Attendance rates are provided for each district for the overall student population and for the high needs student population. Students who are considered high needs include students who are English language learners, who receive special education, or who qualify for free and reduced lunch. When no attendance data is displayed in a cell, data have been suppressed to safeguard student confidentiality, or to ensure that statistics based on a very small sample size are not interpreted as equally representative as those based on a sufficiently larger sample size. For more information on CSDE data suppression policies, please visit http://edsight.ct.gov/relatedreports/BDCRE%20Data%20Suppression%20Rules.pdf.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Small-scale Deepfake Forgery Video Dataset (SDFVD) is a custom dataset consisting of real and deepfake videos with diverse contexts designed to study and benchmark deepfake detection algorithms. The dataset comprising of a total of 106 videos, with 53 original and 53 deepfake videos. Equal number of real and deepfake videos, ensures balance for machine learning model training and evaluation. The original videos were collected from Pexels: a well- known provider of stock photography and stock footage(video). These videos include a variety of backgrounds, and the subjects represent different genders and ages, reflecting a diverse range of scenarios. The input videos have been pre-processed by cropping them to a length of approximately 4 to 5 seconds and resizing them to 720p resolution, ensuring a consistent and uniform format across the dataset. Deepfake videos were generated using Remaker AI employing face-swapping techniques. Remaker AI is an AI-powered platform that can generate images, swap faces in photos and videos, and edit content. The source face photos for these swaps were taken from Freepik: is an image bank website provides contents such as photographs, illustrations and vector images. SDFVD was created due to the lack of availability of any such comparable small-scale deepfake video datasets. Key benefits of such datasets are: • In educational settings or smaller research labs, smaller datasets can be particularly useful as they require fewer resources, allowing students and researchers to conduct experiments with limited budgets and computational resources. • Researchers can use small-scale datasets to quickly prototype new ideas, test concepts, and refine algorithms before scaling up to larger datasets. Overall, SDFVD offers a compact but diverse collection of real and deepfake videos, suitable for a variety of applications, including research, security, and education. It serves as a valuable resource for exploring the rapidly evolving field of deepfake technology and its impact on society.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset tracks annual total students amount from 2003 to 2023 for Small Middle School
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Adapted from: [https://www.kaggle.com/datasets/csmalarkodi/covid-fake-news-dataset].
Facebook
TwitterThis dataset contains 1000000 rows of realistic student performance data, designed for beginners in Machine Learning to practice Linear Regression, model training, and evaluation techniques.
Each row represents one student with features like study hours, attendance, class participation, and final score.
The dataset is small, clean, and structured to be beginner-friendly.
Random noise simulates differences in learning ability, motivation, etc.
Regression Tasks
total_score from weekly_self_study_hours. attendance_percentage and class_participation. Classification Tasks
grade (A–F) using study hours, attendance, and participation. Model Evaluation Practice
✅ This dataset is intentionally kept simple, so that new ML learners can clearly see the relationship between input features (study, attendance, participation) and output (score/grade).