95 datasets found

h
Anonymize or Synthesize? – Privacy-Preserving Methods for Heart Failure...
heidata.uni-heidelberg.de
pdf, tsv, txt
Updated Nov 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tim Ingo Johann; Tim Ingo Johann; Karen Otte; Karen Otte; Fabian Prasser; Fabian Prasser; Christoph Dieterich; Christoph Dieterich (2024). Anonymize or Synthesize? – Privacy-Preserving Methods for Heart Failure Score Analytics [data] [Dataset]. http://doi.org/10.11588/DATA/MXM0Q2
Explore at:
tsv(197975), tsv(190296), tsv(191831), pdf(640128), tsv(107100), txt(3421), tsv(286102), tsv(106632)Available download formats
Unique identifier
https://doi.org/10.11588/DATA/MXM0Q2
Dataset updated
Nov 20, 2024
Dataset provided by
heiDATA
Authors
Tim Ingo Johann; Tim Ingo Johann; Karen Otte; Karen Otte; Fabian Prasser; Fabian Prasser; Christoph Dieterich; Christoph Dieterich
License
https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.11588/DATA/MXM0Q2https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.11588/DATA/MXM0Q2
Description
In the publication [1] we implemented anonymization and synthetization techniques for a structured data set, which was collected during the HiGHmed Use Case Cardiology study [2]. We employed the data anonymization tool ARX [3] and the data synthetization framework ASyH [4] individually and in combination. We evaluated the utility and shortcomings of the different approaches by statistical analyses and privacy risk assessments. Data utility was assessed by computing two heart failure risk scores (Barcelona BioHF [5] and MAGGIC [6]) on the protected data sets. We observed only minimal deviations to scores from the original data set. Additionally, we performed a re-identification risk analysis and found only minor residual risks for common types of privacy threats. We could demonstrate that anonymization and synthetization methods protect privacy while retaining data utility for heart failure risk assessment. Both approaches and a combination thereof introduce only minimal deviations from the original data set over all features. While data synthesis techniques produce any number of new records, data anonymization techniques offer more formal privacy guarantees. Consequently, data synthesis on anonymized data further enhances privacy protection with little impacting data utility. We hereby share all generated data sets with the scientific community through a use and access agreement. [1] Johann TI, Otte K, Prasser F, Dieterich C: Anonymize or synthesize? Privacy-preserving methods for heart failure score analytics. Eur Heart J 2024;. doi://10.1093/ehjdh/ztae083 [2] Sommer KK, Amr A, Bavendiek, Beierle F, Brunecker P, Dathe H et al. Structured, harmonized, and interoperable integration of clinical routine data to compute heart failure risk scores. Life (Basel) 2022;12:749. [3] Prasser F, Eicher J, Spengler H, Bild R, Kuhn KA. Flexible data anonymization using ARX—current status and challenges ahead. Softw Pract Exper 2020;50:1277–1304. [4] Johann TI, Wilhelmi H. ASyH—anonymous synthesizer for health data, GitHub, 2023. Available at: https://github.com/dieterich-lab/ASyH. [5] Lupón J, de Antonio M, Vila J, Peñafiel J, Galán A, Zamora E, et al. Development of a novel heart failure risk tool: the Barcelona bio-heart failure risk calculator (BCN Bio-HF calculator). PLoS One 2014;9:e85466. [6] Pocock SJ, Ariti CA, McMurray JJV, Maggioni A, Køber L, Squire IB, et al. Predicting survival in heart failure: a risk score based on 39 372 patients from 30 studies. Eur Heart J 2013;34:1404–1413.
u
Data from: Extracted and Anonymised Qualitative Data on Students' Acceptance...
recerca.uoc.edu
Updated 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raffaghelli, Juliana Elisa; Loria-Soriano, Eugenia; González, M. Elena Rodríguez; Bañeres, David; Guerrero-Roldán, Ana Elena; Raffaghelli, Juliana Elisa; Loria-Soriano, Eugenia; González, M. Elena Rodríguez; Bañeres, David; Guerrero-Roldán, Ana Elena (2022). Extracted and Anonymised Qualitative Data on Students' Acceptance of an Early Warning System [Dataset]. https://recerca.uoc.edu/documentos/67321ecaaea56d4af0485d1a
Explore at:
Dataset updated
2022
Authors
Raffaghelli, Juliana Elisa; Loria-Soriano, Eugenia; González, M. Elena Rodríguez; Bañeres, David; Guerrero-Roldán, Ana Elena; Raffaghelli, Juliana Elisa; Loria-Soriano, Eugenia; González, M. Elena Rodríguez; Bañeres, David; Guerrero-Roldán, Ana Elena
Description
The data published in this record was adopted in the following study: Exploring Higher Education students' experience with AI-powered educational tools: The case of an Early Warning System The study analyses the students' experience of an early warning system developed at a fully online university. The study is based on 21 semi-structured interviews that yielded a corpus of 21,761 words, for which a mixed inductive and deductive codification approach was applied after thematic analysis. We focused on 11 themes, 52 subthemes, and 396 coded segments to perform content analysis. Our findings revealed that the students, primarily senior workers with a high-level academic self-efficacy, had little experience with this type of system and low expectations about it. However, a usage experience triggered interest and meaningful reflections on the mentioned tool. Nevertheless, a comparative analysis between disciplines related to Computer Science and Economics showed higher confidence and expectation about the system and artificial intelligence overall by the first group. These results highlight the relevance of supporting students' further experiences and understanding of artificial intelligence systems in education to accept them and mainly to participate in iterative development processes of such tools to achieve quality, relevance, and fairness. The three records attached as part of the dataset include: 1- The General CodeTree with exemplar coding excerpts in Spanish
2- Extract of transcriptions in English
3- Full Report in Spanish as extracted from NVIVO, including the extracted codes for the synthesis (1,2) in blue, and the comments made by the two researchers engaged in the interrater agreement.
4- General Content Analysis (Spreadsheet ODS)
Anonymized data for statstics course for Summer trainees.
figshare.com
txt
Updated Jun 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Clifton D. Fuller (2023). Anonymized data for statstics course for Summer trainees. [Dataset]. http://doi.org/10.6084/m9.figshare.23586663.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.23586663.v1
Dataset updated
Jun 27, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Clifton D. Fuller
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Anonymized dataset for MD Anderson Summer Student Stats Course.
👨‍🎓 Open University Learning Analytics
kaggle.com
zip
Updated Mar 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
mexwell (2024). 👨‍🎓 Open University Learning Analytics [Dataset]. https://www.kaggle.com/datasets/mexwell/open-university-learning-analytics
Explore at:
zip(44198573 bytes)Available download formats
Dataset updated
Mar 5, 2024
Authors
mexwell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset introduces the anonymised Open University Learning Analytics Dataset (OULAD). It contains data about courses, students and their interactions with Virtual Learning Environment (VLE) for seven selected courses (called modules). Presentations of courses start in February and October - they are marked by “B” and “J” respectively. The dataset consists of tables connected using unique identifiers. All tables are stored in the csv format.

Database schema

https://analyse.kmi.open.ac.uk/resources/images/model.png" alt="">

courses.csv File contains the list of all available modules and their presentations. The columns are: - code_module – code name of the module, which serves as the identifier. - code_presentation – code name of the presentation. It consists of the year and “B” for the presentation starting in February and “J” for the presentation starting in October. - length - length of the module-presentation in days.

The structure of B and J presentations may differ and therefore it is good practice to analyse the B and J presentations separately. Nevertheless, for some presentations the corresponding previous B/J presentation do not exist and therefore the J presentation must be used to inform the B presentation or vice versa. In the dataset this is the case of CCC, EEE and GGG modules.

assessments.csv This file contains information about assessments in module-presentations. Usually, every presentation has a number of assessments followed by the final exam. CSV contains columns:

code_module – identification code of the module, to which the assessment belongs.

code_presentation - identification code of the presentation, to which the assessment belongs.

id_assessment – identification number of the assessment.

assessment_type – type of assessment. Three types of assessments exist: Tutor Marked Assessment (TMA), Computer Marked Assessment (CMA) and Final Exam (Exam).

date – information about the final submission date of the assessment calculated as the number of days since the start of the module-presentation. The starting date of the presentation has number 0 (zero).

weight - weight of the assessment in %. Typically, Exams are treated separately and have the weight 100%; the sum of all other assessments is 100%. If the information about the final exam date is missing, it is at the end of the last presentation week.

vle.csv The csv file contains information about the available materials in the VLE. Typically these are html pages, pdf files, etc. Students have access to these materials online and their interactions with the materials are recorded. The vle.csv file contains the following columns:

id_site – an identification number of the material.

code_module – an identification code for module.

code_presentation - the identification code of presentation.

activity_type – the role associated with the module material.

week_from – the week from which the material is planned to be used.

week_to – week until which the material is planned to be used.

studentInfo.csv This file contains demographic information about the students together with their results. File contains the following columns:

code_module – an identification code for a module on which the student is registered.

code_presentation - the identification code of the presentation during which the student is registered on the module.

id_student – a unique identification number for the student.

gender – the student’s gender.

region – identifies the geographic region, where the student lived while taking the module-presentation.

highest_education – highest student education level on entry to the module presentation.

imd_band – specifies the Index of Multiple Depravation band of the place where the student lived during the module-presentation.

age_band – band of the student’s age.

num_of_prev_attempts – the number times the student has attempted this module.

studied_credits – the total number of credits for the modules the student is currently studying.

disability – indicates whether the student has declared a disability.

final_result – student’s final result in the module-presentation.

studentRegistration.csv This file contains information about the time when the student registered for the module presentation. For students who unregistered the date of unregistration is also recorded. File contains five columns:

code_module – an identification code for a module.

code_presentation - the identification code of the presentation.

id_student – a unique identification number for the student.

date_registration – the date of student’s registration on the module presentation, this is the number of days measured relative to the start of the module-presentation (e.g. the negative value -30 means that the student registered to module presentation 30 days before it started).

date_unr...
f
Linguistic description of the six clusters.
figshare.com
xls
Updated Oct 8, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paulo Fazendeiro; Paula Prata; Maria Eugénia Ferrão (2025). Linguistic description of the six clusters. [Dataset]. http://doi.org/10.1371/journal.pone.0332441.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0332441.t003
Dataset updated
Oct 8, 2025
Dataset provided by
PLOS ONE
Authors
Paulo Fazendeiro; Paula Prata; Maria Eugénia Ferrão
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This work investigates the trade-off between data anonymization and utility, particularly focusing on the implications for equity-related research in education. Using microdata from the 2019 Brazilian National Student Performance Exam (ENADE), the study applies the (ε, δ)-Differential Privacy model to explore the impact of anonymization on the dataset’s utility for socio-educational equity analysis. By clustering both the original and anonymized datasets, the research evaluates how group categories related to students’ sociodemographic variables, such as gender, race, income, and parental education, are affected by the anonymization process. The results reveal that while anonymization techniques can preserve overall data structure, they can also lead to the suppression or misrepresentation of minority groups, introducing biases that may jeopardise the promotion of educational equity. This finding highlights the importance of involving domain experts in the interpretation of anonymized data, particularly in studies aimed at reducing socio-economic inequalities. The study concludes that careful attention is needed to prevent anonymization efforts from distorting key group categories, which could undermine the validity of data-driven policies aimed at promoting equity.
International Students
kaggle.com
zip
Updated Mar 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FATEMA ISLAM MEEM (2024). International Students [Dataset]. https://www.kaggle.com/datasets/fatemaislammeem/international-students
Explore at:
zip(6732 bytes)Available download formats
Dataset updated
Mar 1, 2024
Authors
FATEMA ISLAM MEEM
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Title: Survey of International Students

Creator: Fatema Islam Meem, Imran Hussain Mahdy

Subject: International Students; Academic and Social Integration; University of Idaho

Description: This dataset contains responses from a survey conducted to understand the factors affecting the academic success and social integration of international students at the University of Idaho.

Contributor: University of Idaho

Date: February 10, 2024

Type: Dataset

Format: CSV

Source: Goole Form

Language: English

Coverage: University of Idaho, [2014-2024]

Sources: The primary data source was a Google Form survey designed to capture international students' perspectives on their integration into the academic and social fabric of the university. Questions were developed to explore academic challenges, social integration, support systems, and overall satisfaction with their university experience.

Collection Methodology: The survey was distributed with the assistance of the International Programs Office (IPO) at the University of Idaho to ensure a broad reach among the target demographic. Efforts were made to design the survey questions to be clear, concise, and sensitive to the cultural diversity of the respondents. The collection process faced challenges, particularly in achieving a high response rate. Despite these obstacles, approximately 48 responses were obtained, providing valuable insights into the experiences of international students. The survey data were anonymized to protect respondents' privacy and maintain data integrity.
AI Tools Usage by Indian College Students 2025
kaggle.com
zip
Updated Jul 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kshitij Saini (2025). AI Tools Usage by Indian College Students 2025 [Dataset]. https://www.kaggle.com/datasets/kshitijsaini121/ai-tools-usage-by-indian-college-students-2025
Explore at:
zip(90645 bytes)Available download formats
Dataset updated
Jul 7, 2025
Authors
Kshitij Saini
Description
AI Tool Usage by Indian College Students 2025

This unique dataset, collected via a May 2025 survey, captures how 496 Indian college students use AI tools (e.g., ChatGPT, Gemini, Copilot) in academics. It includes 16 attributes like AI tool usage, trust, impact on grades, and internet access, ideal for education analytics and machine learning.

Columns

Student_Name: Anonymized student name.

** College_Name:** College attended.

Stream: Academic discipline (e.g., Engineering, Arts).

Year_of_Study: Year of study (1–4). -** AI_Tools_Used: **Tools used (e.g., ChatGPT, Gemini).

Daily_Usage_Hours: Hours spent daily on AI tools. -** Use_Cases:** Purposes (e.g., Assignments, Exam Prep). -** Trust_in_AI_Tools:** Trust level (1–5). -** Impact_on_Grades:** Grade impact (-3 to +3).

Do_Professors_Allow_Use: Professor approval (Yes/No). -** Preferred_AI_Tool:** Preferred tool. -** Awareness_Level: **AI awareness (1–10).

Willing_to_Pay_for_Access: Willingness to pay (Yes/No). -** State:** Indian state. -** Device_Used:** Device (e.g., Laptop, Mobile). -** Internet_Access: **Access quality (Poor/Medium/High). ### Use Cases Predict academic performance using AI tool usage. Analyze trust in AI across streams or regions. Cluster students by usage patterns. Study digital divide via Internet_Access. Source: Collected via Google Forms survey in May 2025, ensuring diverse representation across India.
G
Anonymization Tools for Traffic Data Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Sep 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Anonymization Tools for Traffic Data Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/anonymization-tools-for-traffic-data-market
Explore at:
pptx, csv, pdfAvailable download formats
Dataset updated
Sep 1, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Anonymization Tools for Traffic Data Market Outlook

According to our latest research, the global market size for Anonymization Tools for Traffic Data reached USD 1.12 billion in 2024, reflecting robust adoption across various sectors. The market is projected to expand at a CAGR of 14.6% during the forecast period, reaching a value of USD 3.48 billion by 2033. This impressive growth is primarily driven by the increasing need for privacy-compliant data sharing and analysis in smart mobility and urban infrastructure, as well as stringent data protection regulations worldwide.

The surge in demand for Anonymization Tools for Traffic Data is fundamentally fueled by the exponential growth in data generated by intelligent transportation systems, connected vehicles, and urban mobility solutions. As cities embrace smart technologies to enhance traffic flow, reduce congestion, and improve public safety, the volume of sensitive traffic data collected from various sources such as sensors, cameras, and mobile devices has soared. However, this data often contains personally identifiable information (PII), raising significant privacy concerns. The implementation of robust anonymization tools has become a necessity for organizations to comply with regulations like GDPR, CCPA, and other regional data protection laws. These tools ensure that sensitive information is effectively masked or de-identified, enabling data-driven insights without compromising individual privacy, which in turn fuels market growth.

Another critical growth factor is the increasing collaboration between public and private entities to foster innovation in mobility analytics and urban planning. Governments, transportation authorities, and research organizations are leveraging anonymized traffic data to develop predictive models, optimize public transit routes, and design safer urban environments. The ability to securely share and analyze large volumes of traffic data without exposing personal information is central to these initiatives. Furthermore, advancements in artificial intelligence and machine learning have enhanced the capabilities of anonymization tools, allowing for more sophisticated data transformation techniques that maintain data utility while ensuring compliance. This technological evolution is propelling the adoption of anonymization solutions across diverse end-user segments.

The proliferation of smart city projects and the integration of Internet of Things (IoT) devices in transportation infrastructure are also significant drivers for the Anonymization Tools for Traffic Data Market. As urban centers worldwide invest in real-time traffic monitoring, autonomous vehicles, and multimodal mobility platforms, the complexity and sensitivity of traffic data continue to increase. Anonymization tools have become indispensable in enabling secure data exchange among stakeholders, facilitating cross-sector collaboration, and supporting data monetization strategies. Additionally, growing public awareness around digital privacy and the reputational risks associated with data breaches are prompting organizations to prioritize data anonymization as a core component of their digital strategy.

The advent of the Vehicle Data Anonymization Platform is revolutionizing how sensitive vehicle information is managed and utilized in the transportation sector. As connected vehicles become more prevalent, the data they generate is invaluable for enhancing traffic management, improving safety, and optimizing vehicle performance. However, this data often includes personal information that must be protected to comply with privacy regulations. A Vehicle Data Anonymization Platform provides a robust solution by ensuring that data is anonymized before it is shared or analyzed, thus preserving privacy while still allowing for valuable insights to be derived. This platform is crucial for enabling secure data exchange between automotive manufacturers, service providers, and urban planners, fostering innovation and collaboration across the mobility ecosystem.

From a regional perspective, North America currently leads the Anonymization Tools for Traffic Data Market, accounting for the largest share in 2024. This dominance is attributed to early adoption of advanced traffic management systems, a mature regulatory landscape, and significant investments in smart
Z
Student activity data
data.niaid.nih.gov
nde-dev.biothings.io
+2more
Updated Jul 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erkan Er (2021). Student activity data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4331704
Explore at:
Dataset updated
Jul 2, 2021
Dataset provided by
Universidad de Valladolid
Authors
Erkan Er
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset includes anonymized records of student activities in a peer review activity.
D
Wi-Fi Probe Data Anonymization Services Market Research Report 2033
dataintelo.com
csv, pdf, pptx
Updated Sep 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Wi-Fi Probe Data Anonymization Services Market Research Report 2033 [Dataset]. https://dataintelo.com/report/wi-fi-probe-data-anonymization-services-market
Explore at:
pdf, pptx, csvAvailable download formats
Dataset updated
Sep 30, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Wi-Fi Probe Data Anonymization Services Market Outlook

According to our latest research, the global Wi-Fi Probe Data Anonymization Services market size reached USD 1.43 billion in 2024, reflecting robust demand driven by heightened privacy regulations and the exponential growth of smart infrastructure. The market is expected to expand at a CAGR of 17.2% from 2025 to 2033, with the forecasted market size reaching USD 5.17 billion by 2033. This growth is primarily fueled by increasing concerns regarding data privacy, the proliferation of Wi-Fi-enabled devices, and the necessity for organizations to comply with global data protection standards.

One of the most significant growth factors for the Wi-Fi Probe Data Anonymization Services market is the intensifying regulatory landscape surrounding data privacy. With the enforcement of stringent data protection frameworks such as the General Data Protection Regulation (GDPR) in Europe, the California Consumer Privacy Act (CCPA), and similar regulations emerging in Asia and Latin America, organizations are under immense pressure to ensure that data collected from Wi-Fi probes is anonymized and compliant. These regulations mandate that personally identifiable information (PII) is protected, creating a strong incentive for enterprises to invest in advanced anonymization services. Additionally, the increasing frequency and sophistication of cyber threats have underscored the importance of robust data anonymization protocols, further propelling market demand.

Another critical driver is the rapid expansion of smart city initiatives and the widespread adoption of IoT devices. Urban centers across the globe are leveraging Wi-Fi probe technology to collect valuable data on foot traffic, mobility patterns, and public safety. However, the collection of such granular data raises significant privacy concerns, necessitating the deployment of effective anonymization solutions. The ability of Wi-Fi probe data anonymization services to mask or encrypt sensitive information without compromising the utility of the data is a key value proposition, enabling municipalities and private enterprises to harness actionable insights while maintaining compliance with privacy laws. This trend is expected to accelerate as cities continue to digitize their infrastructure and prioritize citizen privacy.

The growing integration of Wi-Fi analytics in sectors such as retail, hospitality, and transportation is also contributing to market expansion. Retailers, for instance, use Wi-Fi probe data to analyze customer behavior, optimize store layouts, and enhance personalized marketing. Similarly, transportation hubs rely on this data to manage passenger flow and improve operational efficiency. However, as consumers become increasingly aware of how their data is used, there is mounting pressure on businesses to employ anonymization services that safeguard user identities. The competitive advantage gained by maintaining consumer trust and adhering to best practices in data privacy is becoming a decisive factor for organizations across these verticals.

Regionally, North America remains at the forefront of the Wi-Fi Probe Data Anonymization Services market, accounting for the largest revenue share in 2024, followed closely by Europe and the Asia Pacific. The dominance of North America can be attributed to a combination of advanced technological infrastructure, early adoption of privacy-centric solutions, and a mature regulatory environment. Meanwhile, the Asia Pacific region is anticipated to witness the fastest growth over the forecast period, driven by rapid urbanization, expanding digital economies, and increasing investments in smart city projects. Latin America and the Middle East & Africa are also expected to demonstrate steady growth as regulatory frameworks strengthen and digital transformation initiatives gain momentum.

Service Type Analysis

The service type segment of the Wi-Fi Probe Data Anonymization Services market encompasses data masking, data tokenization, data encryption, and other specialized anonymization techniques. Data masking remains a foundational approach, enabling organizations to obfuscate sensitive information such as MAC addresses and device identifiers while preserving the analytical value of the data. This technique is widely adopted in industries where real-time analytics are crucial, such as retail and transportation. The demand for data masking is expec
AI in STEM Education
kaggle.com
zip
Updated May 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
bala (2025). AI in STEM Education [Dataset]. https://www.kaggle.com/datasets/balams81/ai-in-stem-education
Explore at:
zip(43483 bytes)Available download formats
Dataset updated
May 27, 2025
Authors
bala
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset contains anonymized student data from a research study exploring the effects of different educational interventions on students in STEM (Science, Technology, Engineering, and Mathematics) courses across various universities. The study involved different conditions: an AI-tool-only intervention, a mindset-training-only intervention, a combined AI and mindset intervention, and a control group.

The data encompasses a rich set of variables, including:

Demographics: Age, gender, ethnicity. Academic Background: University, course type (e.g., Computer Science, Mathematics, Biology, Physics, Engineering), and ACT scores. Psychological Measures: Baseline, mid-intervention, and final measures of intrinsic motivation, STEM anxiety, classroom comfort, and growth beliefs. Intervention-Specific Perceptions: For relevant groups, perceptions of AI usefulness and ease of use. Perceived Support: Perceptions of instructor universal beliefs, peer support, and faculty support. Academic Outcomes: Course grade, final exam score, assignment average, class attendance rate, and study hours per week. This dataset is valuable for researchers and data scientists interested in educational psychology, learning analytics, the impact of AI in education, mindset interventions, and factors influencing student success in STEM fields. It allows for the exploration of how different interventions might affect student motivation, anxiety, beliefs, and academic performance over time.
High School Students Performance Dataset
kaggle.com
zip
Updated Nov 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamed1970 (2025). High School Students Performance Dataset [Dataset]. https://www.kaggle.com/datasets/mohamed1970/high-school-students-performance-dataset
Explore at:
zip(84660 bytes)Available download formats
Dataset updated
Nov 15, 2025
Authors
Mohamed1970
Description
✍️ Detailed Description Content

Comprehensive High School Student Performance Dataset (2,000 Records)

This rich dataset contains 2,000 anonymized student records designed for in-depth analysis of academic success factors. It includes essential demographic information and performance scores across 13 core high school subjects.

Key Data Points:

Demographics: Includes Gender, Date of Birth, Grade, and Class.

Academic Performance: Detailed scores (out of 20 points each) for subjects ranging from Algebra II to Pre-Calculus and Computer Science.

Outcome Metrics: Includes Total Score and the final Result (Pass/Fail).

**Goal/Inspiration: **The dataset is intended to be a strong benchmark for machine learning projects focused on:

Predicting a student's final Result based on their demographic profile and initial subject scores.

Identifying which subjects are the strongest predictors of overall academic success.

Analyzing performance trends across different Grades and Classes.
r
Student Reflections on Academic Integrity and the Use of ChatGPT in Swedish...
researchdata.se
demo.researchdata.se
+1more
Updated Sep 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christophe Premat; Alexandra Farazouli (2025). Student Reflections on Academic Integrity and the Use of ChatGPT in Swedish Higher Education. Data from a specific survey made in September 2024 [Dataset]. http://doi.org/10.17045/STHLMUNI.30040633
Explore at:
Unique identifier
https://doi.org/10.17045/STHLMUNI.30040633
Dataset updated
Sep 3, 2025
Dataset provided by
Stockholm University
Authors
Christophe Premat; Alexandra Farazouli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains anonymized written reflections from 42 students enrolled in a course on academic integrity at Stockholm University. The data were originally collected on 6 September 2024 during a seminar on ethical aspects of using generative AI in academic work. Students were asked to respond to the question: “Should I mention that I used ChatGPT to complete an academic assignment?”

The reflections, totaling 2,626 words, were first written in Swedish and subsequently translated into English for research purposes. The dataset includes:

The original student responses in Swedish.

The English translations of the responses.

Contextual information about the teaching activity (including prior exposure to plagiarism-prevention resources and a self-study course on academic integrity). The data are fully anonymized in compliance with GDPR and cannot be linked back to individual students. They were used in the article: Christophe Premat & Alexandra Farazouli (2025). “Academic Integrity vs. Artificial Intelligence: a tale of two AIs,” Práxis Educativa, v. 20, e24871. https://doi.org/10.5212/PraxEduc.v.20.24871.016 (https://doi.org/10.5212/PraxEduc.v.20.24871.016?utm_source=chatgpt.com)
Z
Data from: Problem Solving and Algorithmic Development with Flowcharts
data.niaid.nih.gov
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Smetsers-Weeda, Renske; Smetsers, Sjaak (2024). Problem Solving and Algorithmic Development with Flowcharts [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8134150
Explore at:
Dataset updated
Jul 11, 2024
Dataset provided by
Radboud University Nijmegen
Ra
Authors
Smetsers-Weeda, Renske; Smetsers, Sjaak
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This study reports on an in-depth research into student-learning using a "thinking-first" framework combined with stepwise heuristics, to provide students structure throughout the entire programming process.

The study targetted secondary education students in an elective computer science course. There was one class with 11 Dutch high school students, of which 2 females and 9 males. The group was heterogeneous, with students from different academic levels and age-groups. Each student’s level and previous experience with CS was determined a priori using a pretest.

For this study we developed sets of quizes, tasks and tests comprised of code comprehension, code composition questions (including reading and creating flowchart designs). The student responses to each were anaylzed.

This repository contains the following data: - taxonomyPerQ.pdf: indicates taxonomy level of each (quiz, task, test) question answered by students - assessments_unanswered: all questions (quizes, tasks, tests) administered to students - pretask (responses): anonymized student responses to pretask questions - midtask (responses): anonymized student responses to midtask questions - finaltask (responses): anonymized student responses to finaltask questions - quiz 1 (responses): anonymized student responses to quiz 1 questions - quiz 2 (responses): anonymized student responses to quiz 2 questions - final test (responses): anonymized student responses to final test questions

The student's handwritten work was scanned, saved as pdf and coded in atlas.ti. These coded pdf's cannot be anonymized anymore, and thus not openly distributed or published.
H
E4 Data: Students: Processed Quantitative Demographic Dataset
dataverse.harvard.edu
search.dataone.org
Updated Apr 27, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cathy Lachapelle (2020). E4 Data: Students: Processed Quantitative Demographic Dataset [Dataset]. http://doi.org/10.7910/DVN/FS5YD5
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/FS5YD5
Dataset updated
Apr 27, 2020
Dataset provided by
Harvard Dataverse
Authors
Cathy Lachapelle
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset includes all anonymized quantitative student-level demographic data, and a codebook explaining the variables contained in the dataset.
University Academic Misconduct Detector Dataset
kaggle.com
zip
Updated Aug 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dimpi Mittal (2025). University Academic Misconduct Detector Dataset [Dataset]. https://www.kaggle.com/datasets/dimpimittal/university-academic-misconduct-detector-dataset
Explore at:
zip(4557 bytes)Available download formats
Dataset updated
Aug 13, 2025
Authors
Dimpi Mittal
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The University Academic Misconduct Detector Dataset is designed to support research and development of AI-powered tools for identifying potential academic dishonesty in higher education institutions. With the rise of online and hybrid learning environments, universities face growing challenges in detecting plagiarism, contract cheating, and other forms of misconduct. This dataset provides a clean, well-structured collection of academic records, behavior logs, and submission patterns to enable data-driven solutions for academic integrity.

Purpose & Use Cases

The dataset aims to help:

Researchers develop advanced machine learning models for detecting anomalies in student work.

Educators identify early warning signs of potential misconduct and provide timely interventions.

Universities create automated monitoring systems to flag suspicious activity in large cohorts.

Data Scientists explore classification, clustering, and anomaly detection techniques in an educational setting.

Dataset Composition

The dataset contains the following key components:

Student Metadata: Basic anonymized demographic details (e.g., age group, department, study year).

Assignment & Exam Submissions: Scores, submission times, and similarity index from plagiarism detection tools.

Behavioral Indicators: Login frequency, resource usage, and last-minute submission patterns.

Misconduct Labels: Ground truth annotations indicating whether a case was flagged as misconduct or not.

All data has been anonymized to ensure privacy and compliance with ethical guidelines.

Applications in Machine Learning

Binary Classification: Predict whether a student’s submission involves misconduct.

Anomaly Detection: Spot unusual submission patterns compared to the class average.

Feature Engineering Practice: Extract meaningful features from behavioral logs for predictive modeling.

Explainable AI Research: Explore interpretable models for sensitive decision-making in education.

Why This Dataset Matters

Academic misconduct is a growing global concern that threatens the credibility of higher education. By providing this dataset, the goal is to foster transparent, ethical, and AI-assisted solutions that help maintain fairness and academic excellence.

Ethical Considerations

Data is synthetic/anonymized and does not represent real individuals.

Intended for educational and research purposes only.

Any deployment of models trained on this dataset should follow strict institutional ethics policies.
o
Data from: ComEd's anonymized AMI energy usage data
openenergyhub.ornl.gov
Updated Jul 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). ComEd's anonymized AMI energy usage data [Dataset]. https://openenergyhub.ornl.gov/explore/dataset/comed-s-anonymized-ami-energy-usage-data/
Explore at:
Dataset updated
Jul 30, 2024
Description
One of the key impacts of AMI technology is the availability of interval energy usage data, which can support the development of new products and services and to enable the market to deliver greater value to customers. Requestors can now access anonymized interval energy usage data in 30 minute intervals for all zip codes where AMI meters have been deployed.
A Hybrid Educational Dataset
kaggle.com
Updated Jun 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emanoel Carvalho Lopes (2025). A Hybrid Educational Dataset [Dataset]. https://www.kaggle.com/datasets/emanoelcarvalholopes/uci-oulad-sintetico-unificados
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 27, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Emanoel Carvalho Lopes
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context

The early identification of students facing learning difficulties is one of the most critical challenges in modern education. Intervening effectively requires leveraging data to understand the complex interplay between student demographics, engagement patterns, and academic performance.

This dataset was created to serve as a high-quality, pre-processed resource for building machine learning models to tackle this very problem. It is a unique hybrid dataset, meticulously crafted by unifying three distinct sources:

The Open University Learning Analytics Dataset (OULAD): A rich dataset detailing student interactions with a Virtual Learning Environment (VLE). We have aggregated the raw, granular data (over 10 million interaction logs) into powerful features, such as total clicks, average assessment scores, and distinct days of activity for each student registration.

The UCI Student Performance Dataset: A classic educational dataset containing demographic information and final grades in Portuguese and Math subjects from two Portuguese schools.

A Synthetic Data Component: A synthetically generated portion of the data, created to balance the dataset or represent specific student profiles.

Data Unification and Pre-processing

A direct merge of these sources was not possible as the student identifiers were not shared. Instead, a strategy of intelligent concatenation was employed. The final dataset has undergone a rigorous pre-processing pipeline to make it immediately usable for machine learning tasks:

Advanced Imputation: Missing values were handled using a sophisticated iterative imputation method powered by Gaussian Mixture Models (GMM), ensuring the dataset's integrity.

One-Hot Encoding: All categorical features have been converted to a numerical format.

Feature Scaling: All numerical features have been standardized (using StandardScaler) to have a mean of 0 and a standard deviation of 1, preventing model bias from features with different scales.

The result is a clean, comprehensive dataset ready for modeling.

File Information

Instance

Each row represents a student profile, and the columns are the features and the target.

Feature

Features include aggregated online engagement metrics (e.g., clicks, distinct activities), academic performance (grades, scores), and student demographics (e.g., gender, age band). A key feature indicates the original data source (OULAD, UCI, Synthetic).

Sensitive Information

The dataset contains no Personally Identifiable Information (PII). Demographic information is presented in broad, anonymized categories.

Key Columns:

Target Variable: had_difficulty: The primary target for classification. This binary variable has been engineered from the original final_result column of the OULAD dataset. 1: The student either failed (Fail) or withdrew (Withdrawn) from the course. 0: The student passed (Pass or Distinction). Feature Groups: OULAD Aggregated Features (e.g., oulad_total_cliques, oulad_media_notas): Quantitative metrics summarizing a student's engagement and performance within the VLE. Academic Performance Features (e.g., nota_matematica_harmonizada): Harmonized grades from different data sources. Demographic Features (e.g., gender_*, age_band_*): One-hot encoded columns representing student demographics. Origin Features (e.g., origem_dado_OULAD, origem_dado_UCI): One-hot encoded columns indicating the original source of the data for each row. This allows for source-specific analysis.

(Note: All numerical feature names are post-scaling and may not directly reflect their original names. Please refer to the complete column list for details.)

Acknowledgements

This dataset would not be possible without the original data providers. Please acknowledge them in any work that uses this data:

OULAD Dataset: Kuzilek, J., Hlosta, M., and Zdrahal, Z. (2017). Open University Learning Analytics dataset. Scientific Data, 4. https://analyse.kmi.open.ac.uk/open_dataset UCI Student Performance Dataset: P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS. https://archive.ics.uci.edu/ml/datasets/student+performance

Inspiration

This dataset is perfect for a variety of predictive modeling tasks. Here are a few ideas to get you started:

Can you build a classification model to predict had_difficulty with high recall? (Minimizing the number of at-risk students we fail to identify).

Which features are the most powerful predictors of student failure or withdrawal? (Feature Importance Analysis).

Can you build separate models for each data origin (origem_dado_*) and compare ...
h
text-anonymization-benchmark-train
huggingface.co
Updated Feb 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mateusz Dziemian (2025). text-anonymization-benchmark-train [Dataset]. https://huggingface.co/datasets/mattmdjaga/text-anonymization-benchmark-train
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 11, 2025
Authors
Mateusz Dziemian
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset card for Text Anonymization Benchmark (TAB) train

Dataset Summary

This is the training split of the Text Anonymisation Benchmark. As the title says it's a dataset focused on text anonymisation, specifcially European Court Documents, which contain labels by mutltiple annotators.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

[More Information Needed]

Dataset Structure Data Instances

[More Information… See the full description on the dataset page: https://huggingface.co/datasets/mattmdjaga/text-anonymization-benchmark-train.
n
Cross-sectional survey of students in the North East of England raw data...
figshare.northumbria.ac.uk
xlsx
Updated May 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katie Haighton (2024). Cross-sectional survey of students in the North East of England raw data anonymised.xlsx [Dataset]. http://doi.org/10.25398/rd.northumbria.25783284.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.25398/rd.northumbria.25783284.v1
Dataset updated
May 9, 2024
Dataset provided by
Northumbria University
Authors
Katie Haighton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A cross-sectional, mixed methods online survey was deployed to students in five universities in the NE. The survey explored whether students changed behaviour at university, what they changed and why they changed. Their engagement in physical activity, smoking, diet (consumption of fruit and vegetables) and alcohol consumption was assessed.

Facebook

Twitter

Click to copy link

Link copied

Cite

Tim Ingo Johann; Tim Ingo Johann; Karen Otte; Karen Otte; Fabian Prasser; Fabian Prasser; Christoph Dieterich; Christoph Dieterich (2024). Anonymize or Synthesize? – Privacy-Preserving Methods for Heart Failure Score Analytics [data] [Dataset]. http://doi.org/10.11588/DATA/MXM0Q2

Anonymize or Synthesize? – Privacy-Preserving Methods for Heart Failure Score Analytics [data]

Explore at:

tsv(197975), tsv(190296), tsv(191831), pdf(640128), tsv(107100), txt(3421), tsv(286102), tsv(106632)Available download formats

Unique identifier

https://doi.org/10.11588/DATA/MXM0Q2

Dataset updated

Nov 20, 2024

Dataset provided by

heiDATA

Authors

Tim Ingo Johann; Tim Ingo Johann; Karen Otte; Karen Otte; Fabian Prasser; Fabian Prasser; Christoph Dieterich; Christoph Dieterich

License

https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.11588/DATA/MXM0Q2https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.11588/DATA/MXM0Q2

Description

In the publication [1] we implemented anonymization and synthetization techniques for a structured data set, which was collected during the HiGHmed Use Case Cardiology study [2]. We employed the data anonymization tool ARX [3] and the data synthetization framework ASyH [4] individually and in combination. We evaluated the utility and shortcomings of the different approaches by statistical analyses and privacy risk assessments. Data utility was assessed by computing two heart failure risk scores (Barcelona BioHF [5] and MAGGIC [6]) on the protected data sets. We observed only minimal deviations to scores from the original data set. Additionally, we performed a re-identification risk analysis and found only minor residual risks for common types of privacy threats. We could demonstrate that anonymization and synthetization methods protect privacy while retaining data utility for heart failure risk assessment. Both approaches and a combination thereof introduce only minimal deviations from the original data set over all features. While data synthesis techniques produce any number of new records, data anonymization techniques offer more formal privacy guarantees. Consequently, data synthesis on anonymized data further enhances privacy protection with little impacting data utility. We hereby share all generated data sets with the scientific community through a use and access agreement. [1] Johann TI, Otte K, Prasser F, Dieterich C: Anonymize or synthesize? Privacy-preserving methods for heart failure score analytics. Eur Heart J 2024;. doi://10.1093/ehjdh/ztae083 [2] Sommer KK, Amr A, Bavendiek, Beierle F, Brunecker P, Dathe H et al. Structured, harmonized, and interoperable integration of clinical routine data to compute heart failure risk scores. Life (Basel) 2022;12:749. [3] Prasser F, Eicher J, Spengler H, Bild R, Kuhn KA. Flexible data anonymization using ARX—current status and challenges ahead. Softw Pract Exper 2020;50:1277–1304. [4] Johann TI, Wilhelmi H. ASyH—anonymous synthesizer for health data, GitHub, 2023. Available at: https://github.com/dieterich-lab/ASyH. [5] Lupón J, de Antonio M, Vila J, Peñafiel J, Galán A, Zamora E, et al. Development of a novel heart failure risk tool: the Barcelona bio-heart failure risk calculator (BCN Bio-HF calculator). PLoS One 2014;9:e85466. [6] Pocock SJ, Ariti CA, McMurray JJV, Maggioni A, Køber L, Squire IB, et al. Predicting survival in heart failure: a risk score based on 39 372 patients from 30 studies. Eur Heart J 2013;34:1404–1413.

Clear search

Close search

Google apps

Main menu

Anonymize or Synthesize? – Privacy-Preserving Methods for Heart Failure...

Data from: Extracted and Anonymised Qualitative Data on Students' Acceptance...

Anonymized data for statstics course for Summer trainees.

👨‍🎓 Open University Learning Analytics

Database schema

Linguistic description of the six clusters.

International Students

AI Tools Usage by Indian College Students 2025

AI Tool Usage by Indian College Students 2025

Columns

Anonymization Tools for Traffic Data Market Research Report 2033

Anonymization Tools for Traffic Data Market Outlook

Student activity data

Wi-Fi Probe Data Anonymization Services Market Research Report 2033

Wi-Fi Probe Data Anonymization Services Market Outlook

Service Type Analysis

AI in STEM Education

High School Students Performance Dataset

✍️ Detailed Description Content

Student Reflections on Academic Integrity and the Use of ChatGPT in Swedish...

Data from: Problem Solving and Algorithmic Development with Flowcharts

E4 Data: Students: Processed Quantitative Demographic Dataset

University Academic Misconduct Detector Dataset

Purpose & Use Cases

Dataset Composition

Applications in Machine Learning

Why This Dataset Matters

Ethical Considerations

Data from: ComEd's anonymized AMI energy usage data

A Hybrid Educational Dataset

Context

Data Unification and Pre-processing

File Information

Instance

Feature

Sensitive Information

Acknowledgements

Inspiration

text-anonymization-benchmark-train

Cross-sectional survey of students in the North East of England raw data...

Anonymize or Synthesize? – Privacy-Preserving Methods for Heart Failure Score Analytics [data]