Facebook
Twitterhttps://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.11588/DATA/MXM0Q2https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.11588/DATA/MXM0Q2
In the publication [1] we implemented anonymization and synthetization techniques for a structured data set, which was collected during the HiGHmed Use Case Cardiology study [2]. We employed the data anonymization tool ARX [3] and the data synthetization framework ASyH [4] individually and in combination. We evaluated the utility and shortcomings of the different approaches by statistical analyses and privacy risk assessments. Data utility was assessed by computing two heart failure risk scores (Barcelona BioHF [5] and MAGGIC [6]) on the protected data sets. We observed only minimal deviations to scores from the original data set. Additionally, we performed a re-identification risk analysis and found only minor residual risks for common types of privacy threats. We could demonstrate that anonymization and synthetization methods protect privacy while retaining data utility for heart failure risk assessment. Both approaches and a combination thereof introduce only minimal deviations from the original data set over all features. While data synthesis techniques produce any number of new records, data anonymization techniques offer more formal privacy guarantees. Consequently, data synthesis on anonymized data further enhances privacy protection with little impacting data utility. We hereby share all generated data sets with the scientific community through a use and access agreement. [1] Johann TI, Otte K, Prasser F, Dieterich C: Anonymize or synthesize? Privacy-preserving methods for heart failure score analytics. Eur Heart J 2024;. doi://10.1093/ehjdh/ztae083 [2] Sommer KK, Amr A, Bavendiek, Beierle F, Brunecker P, Dathe H et al. Structured, harmonized, and interoperable integration of clinical routine data to compute heart failure risk scores. Life (Basel) 2022;12:749. [3] Prasser F, Eicher J, Spengler H, Bild R, Kuhn KA. Flexible data anonymization using ARX—current status and challenges ahead. Softw Pract Exper 2020;50:1277–1304. [4] Johann TI, Wilhelmi H. ASyH—anonymous synthesizer for health data, GitHub, 2023. Available at: https://github.com/dieterich-lab/ASyH. [5] Lupón J, de Antonio M, Vila J, Peñafiel J, Galán A, Zamora E, et al. Development of a novel heart failure risk tool: the Barcelona bio-heart failure risk calculator (BCN Bio-HF calculator). PLoS One 2014;9:e85466. [6] Pocock SJ, Ariti CA, McMurray JJV, Maggioni A, Køber L, Squire IB, et al. Predicting survival in heart failure: a risk score based on 39 372 patients from 30 studies. Eur Heart J 2013;34:1404–1413.
Facebook
TwitterThe data published in this record was adopted in the following study: Exploring Higher Education students' experience with AI-powered educational tools: The case of an Early Warning System The study analyses the students' experience of an early warning system developed at a fully online university. The study is based on 21 semi-structured interviews that yielded a corpus of 21,761 words, for which a mixed inductive and deductive codification approach was applied after thematic analysis. We focused on 11 themes, 52 subthemes, and 396 coded segments to perform content analysis. Our findings revealed that the students, primarily senior workers with a high-level academic self-efficacy, had little experience with this type of system and low expectations about it. However, a usage experience triggered interest and meaningful reflections on the mentioned tool. Nevertheless, a comparative analysis between disciplines related to Computer Science and Economics showed higher confidence and expectation about the system and artificial intelligence overall by the first group. These results highlight the relevance of supporting students' further experiences and understanding of artificial intelligence systems in education to accept them and mainly to participate in iterative development processes of such tools to achieve quality, relevance, and fairness. The three records attached as part of the dataset include: 1- The General CodeTree with exemplar coding excerpts in Spanish
2- Extract of transcriptions in English
3- Full Report in Spanish as extracted from NVIVO, including the extracted codes for the synthesis (1,2) in blue, and the comments made by the two researchers engaged in the interrater agreement.
4- General Content Analysis (Spreadsheet ODS)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Anonymized dataset for MD Anderson Summer Student Stats Course.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset introduces the anonymised Open University Learning Analytics Dataset (OULAD). It contains data about courses, students and their interactions with Virtual Learning Environment (VLE) for seven selected courses (called modules). Presentations of courses start in February and October - they are marked by “B” and “J” respectively. The dataset consists of tables connected using unique identifiers. All tables are stored in the csv format.
https://analyse.kmi.open.ac.uk/resources/images/model.png" alt="">
courses.csv File contains the list of all available modules and their presentations. The columns are: - code_module – code name of the module, which serves as the identifier. - code_presentation – code name of the presentation. It consists of the year and “B” for the presentation starting in February and “J” for the presentation starting in October. - length - length of the module-presentation in days.
The structure of B and J presentations may differ and therefore it is good practice to analyse the B and J presentations separately. Nevertheless, for some presentations the corresponding previous B/J presentation do not exist and therefore the J presentation must be used to inform the B presentation or vice versa. In the dataset this is the case of CCC, EEE and GGG modules.
assessments.csv This file contains information about assessments in module-presentations. Usually, every presentation has a number of assessments followed by the final exam. CSV contains columns:
vle.csv The csv file contains information about the available materials in the VLE. Typically these are html pages, pdf files, etc. Students have access to these materials online and their interactions with the materials are recorded. The vle.csv file contains the following columns:
studentInfo.csv This file contains demographic information about the students together with their results. File contains the following columns:
studentRegistration.csv This file contains information about the time when the student registered for the module presentation. For students who unregistered the date of unregistration is also recorded. File contains five columns:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This work investigates the trade-off between data anonymization and utility, particularly focusing on the implications for equity-related research in education. Using microdata from the 2019 Brazilian National Student Performance Exam (ENADE), the study applies the (ε, δ)-Differential Privacy model to explore the impact of anonymization on the dataset’s utility for socio-educational equity analysis. By clustering both the original and anonymized datasets, the research evaluates how group categories related to students’ sociodemographic variables, such as gender, race, income, and parental education, are affected by the anonymization process. The results reveal that while anonymization techniques can preserve overall data structure, they can also lead to the suppression or misrepresentation of minority groups, introducing biases that may jeopardise the promotion of educational equity. This finding highlights the importance of involving domain experts in the interpretation of anonymized data, particularly in studies aimed at reducing socio-economic inequalities. The study concludes that careful attention is needed to prevent anonymization efforts from distorting key group categories, which could undermine the validity of data-driven policies aimed at promoting equity.
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Title: Survey of International Students
Creator: Fatema Islam Meem, Imran Hussain Mahdy
Subject: International Students; Academic and Social Integration; University of Idaho
Description: This dataset contains responses from a survey conducted to understand the factors affecting the academic success and social integration of international students at the University of Idaho.
Contributor: University of Idaho
Date: February 10, 2024
Type: Dataset
Format: CSV
Source: Goole Form
Language: English
Coverage: University of Idaho, [2014-2024]
Sources: The primary data source was a Google Form survey designed to capture international students' perspectives on their integration into the academic and social fabric of the university. Questions were developed to explore academic challenges, social integration, support systems, and overall satisfaction with their university experience.
Collection Methodology: The survey was distributed with the assistance of the International Programs Office (IPO) at the University of Idaho to ensure a broad reach among the target demographic. Efforts were made to design the survey questions to be clear, concise, and sensitive to the cultural diversity of the respondents. The collection process faced challenges, particularly in achieving a high response rate. Despite these obstacles, approximately 48 responses were obtained, providing valuable insights into the experiences of international students. The survey data were anonymized to protect respondents' privacy and maintain data integrity.
Facebook
TwitterThis unique dataset, collected via a May 2025 survey, captures how 496 Indian college students use AI tools (e.g., ChatGPT, Gemini, Copilot) in academics. It includes 16 attributes like AI tool usage, trust, impact on grades, and internet access, ideal for education analytics and machine learning.
Facebook
Twitter
According to our latest research, the global market size for Anonymization Tools for Traffic Data reached USD 1.12 billion in 2024, reflecting robust adoption across various sectors. The market is projected to expand at a CAGR of 14.6% during the forecast period, reaching a value of USD 3.48 billion by 2033. This impressive growth is primarily driven by the increasing need for privacy-compliant data sharing and analysis in smart mobility and urban infrastructure, as well as stringent data protection regulations worldwide.
The surge in demand for Anonymization Tools for Traffic Data is fundamentally fueled by the exponential growth in data generated by intelligent transportation systems, connected vehicles, and urban mobility solutions. As cities embrace smart technologies to enhance traffic flow, reduce congestion, and improve public safety, the volume of sensitive traffic data collected from various sources such as sensors, cameras, and mobile devices has soared. However, this data often contains personally identifiable information (PII), raising significant privacy concerns. The implementation of robust anonymization tools has become a necessity for organizations to comply with regulations like GDPR, CCPA, and other regional data protection laws. These tools ensure that sensitive information is effectively masked or de-identified, enabling data-driven insights without compromising individual privacy, which in turn fuels market growth.
Another critical growth factor is the increasing collaboration between public and private entities to foster innovation in mobility analytics and urban planning. Governments, transportation authorities, and research organizations are leveraging anonymized traffic data to develop predictive models, optimize public transit routes, and design safer urban environments. The ability to securely share and analyze large volumes of traffic data without exposing personal information is central to these initiatives. Furthermore, advancements in artificial intelligence and machine learning have enhanced the capabilities of anonymization tools, allowing for more sophisticated data transformation techniques that maintain data utility while ensuring compliance. This technological evolution is propelling the adoption of anonymization solutions across diverse end-user segments.
The proliferation of smart city projects and the integration of Internet of Things (IoT) devices in transportation infrastructure are also significant drivers for the Anonymization Tools for Traffic Data Market. As urban centers worldwide invest in real-time traffic monitoring, autonomous vehicles, and multimodal mobility platforms, the complexity and sensitivity of traffic data continue to increase. Anonymization tools have become indispensable in enabling secure data exchange among stakeholders, facilitating cross-sector collaboration, and supporting data monetization strategies. Additionally, growing public awareness around digital privacy and the reputational risks associated with data breaches are prompting organizations to prioritize data anonymization as a core component of their digital strategy.
The advent of the Vehicle Data Anonymization Platform is revolutionizing how sensitive vehicle information is managed and utilized in the transportation sector. As connected vehicles become more prevalent, the data they generate is invaluable for enhancing traffic management, improving safety, and optimizing vehicle performance. However, this data often includes personal information that must be protected to comply with privacy regulations. A Vehicle Data Anonymization Platform provides a robust solution by ensuring that data is anonymized before it is shared or analyzed, thus preserving privacy while still allowing for valuable insights to be derived. This platform is crucial for enabling secure data exchange between automotive manufacturers, service providers, and urban planners, fostering innovation and collaboration across the mobility ecosystem.
From a regional perspective, North America currently leads the Anonymization Tools for Traffic Data Market, accounting for the largest share in 2024. This dominance is attributed to early adoption of advanced traffic management systems, a mature regulatory landscape, and significant investments in smart
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes anonymized records of student activities in a peer review activity.
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global Wi-Fi Probe Data Anonymization Services market size reached USD 1.43 billion in 2024, reflecting robust demand driven by heightened privacy regulations and the exponential growth of smart infrastructure. The market is expected to expand at a CAGR of 17.2% from 2025 to 2033, with the forecasted market size reaching USD 5.17 billion by 2033. This growth is primarily fueled by increasing concerns regarding data privacy, the proliferation of Wi-Fi-enabled devices, and the necessity for organizations to comply with global data protection standards.
One of the most significant growth factors for the Wi-Fi Probe Data Anonymization Services market is the intensifying regulatory landscape surrounding data privacy. With the enforcement of stringent data protection frameworks such as the General Data Protection Regulation (GDPR) in Europe, the California Consumer Privacy Act (CCPA), and similar regulations emerging in Asia and Latin America, organizations are under immense pressure to ensure that data collected from Wi-Fi probes is anonymized and compliant. These regulations mandate that personally identifiable information (PII) is protected, creating a strong incentive for enterprises to invest in advanced anonymization services. Additionally, the increasing frequency and sophistication of cyber threats have underscored the importance of robust data anonymization protocols, further propelling market demand.
Another critical driver is the rapid expansion of smart city initiatives and the widespread adoption of IoT devices. Urban centers across the globe are leveraging Wi-Fi probe technology to collect valuable data on foot traffic, mobility patterns, and public safety. However, the collection of such granular data raises significant privacy concerns, necessitating the deployment of effective anonymization solutions. The ability of Wi-Fi probe data anonymization services to mask or encrypt sensitive information without compromising the utility of the data is a key value proposition, enabling municipalities and private enterprises to harness actionable insights while maintaining compliance with privacy laws. This trend is expected to accelerate as cities continue to digitize their infrastructure and prioritize citizen privacy.
The growing integration of Wi-Fi analytics in sectors such as retail, hospitality, and transportation is also contributing to market expansion. Retailers, for instance, use Wi-Fi probe data to analyze customer behavior, optimize store layouts, and enhance personalized marketing. Similarly, transportation hubs rely on this data to manage passenger flow and improve operational efficiency. However, as consumers become increasingly aware of how their data is used, there is mounting pressure on businesses to employ anonymization services that safeguard user identities. The competitive advantage gained by maintaining consumer trust and adhering to best practices in data privacy is becoming a decisive factor for organizations across these verticals.
Regionally, North America remains at the forefront of the Wi-Fi Probe Data Anonymization Services market, accounting for the largest revenue share in 2024, followed closely by Europe and the Asia Pacific. The dominance of North America can be attributed to a combination of advanced technological infrastructure, early adoption of privacy-centric solutions, and a mature regulatory environment. Meanwhile, the Asia Pacific region is anticipated to witness the fastest growth over the forecast period, driven by rapid urbanization, expanding digital economies, and increasing investments in smart city projects. Latin America and the Middle East & Africa are also expected to demonstrate steady growth as regulatory frameworks strengthen and digital transformation initiatives gain momentum.
The service type segment of the Wi-Fi Probe Data Anonymization Services market encompasses data masking, data tokenization, data encryption, and other specialized anonymization techniques. Data masking remains a foundational approach, enabling organizations to obfuscate sensitive information such as MAC addresses and device identifiers while preserving the analytical value of the data. This technique is widely adopted in industries where real-time analytics are crucial, such as retail and transportation. The demand for data masking is expec
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains anonymized student data from a research study exploring the effects of different educational interventions on students in STEM (Science, Technology, Engineering, and Mathematics) courses across various universities. The study involved different conditions: an AI-tool-only intervention, a mindset-training-only intervention, a combined AI and mindset intervention, and a control group.
The data encompasses a rich set of variables, including:
Demographics: Age, gender, ethnicity. Academic Background: University, course type (e.g., Computer Science, Mathematics, Biology, Physics, Engineering), and ACT scores. Psychological Measures: Baseline, mid-intervention, and final measures of intrinsic motivation, STEM anxiety, classroom comfort, and growth beliefs. Intervention-Specific Perceptions: For relevant groups, perceptions of AI usefulness and ease of use. Perceived Support: Perceptions of instructor universal beliefs, peer support, and faculty support. Academic Outcomes: Course grade, final exam score, assignment average, class attendance rate, and study hours per week. This dataset is valuable for researchers and data scientists interested in educational psychology, learning analytics, the impact of AI in education, mindset interventions, and factors influencing student success in STEM fields. It allows for the exploration of how different interventions might affect student motivation, anxiety, beliefs, and academic performance over time.
Facebook
TwitterComprehensive High School Student Performance Dataset (2,000 Records)
This rich dataset contains 2,000 anonymized student records designed for in-depth analysis of academic success factors. It includes essential demographic information and performance scores across 13 core high school subjects.
Key Data Points:
Demographics: Includes Gender, Date of Birth, Grade, and Class.
Academic Performance: Detailed scores (out of 20 points each) for subjects ranging from Algebra II to Pre-Calculus and Computer Science.
Outcome Metrics: Includes Total Score and the final Result (Pass/Fail).
**Goal/Inspiration: **The dataset is intended to be a strong benchmark for machine learning projects focused on:
Predicting a student's final Result based on their demographic profile and initial subject scores.
Identifying which subjects are the strongest predictors of overall academic success.
Analyzing performance trends across different Grades and Classes.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains anonymized written reflections from 42 students enrolled in a course on academic integrity at Stockholm University. The data were originally collected on 6 September 2024 during a seminar on ethical aspects of using generative AI in academic work. Students were asked to respond to the question: “Should I mention that I used ChatGPT to complete an academic assignment?”
The reflections, totaling 2,626 words, were first written in Swedish and subsequently translated into English for research purposes. The dataset includes:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This study reports on an in-depth research into student-learning using a "thinking-first" framework combined with stepwise heuristics, to provide students structure throughout the entire programming process.
The study targetted secondary education students in an elective computer science course. There was one class with 11 Dutch high school students, of which 2 females and 9 males. The group was heterogeneous, with students from different academic levels and age-groups. Each student’s level and previous experience with CS was determined a priori using a pretest.
For this study we developed sets of quizes, tasks and tests comprised of code comprehension, code composition questions (including reading and creating flowchart designs). The student responses to each were anaylzed.
This repository contains the following data: - taxonomyPerQ.pdf: indicates taxonomy level of each (quiz, task, test) question answered by students - assessments_unanswered: all questions (quizes, tasks, tests) administered to students - pretask (responses): anonymized student responses to pretask questions - midtask (responses): anonymized student responses to midtask questions - finaltask (responses): anonymized student responses to finaltask questions - quiz 1 (responses): anonymized student responses to quiz 1 questions - quiz 2 (responses): anonymized student responses to quiz 2 questions - final test (responses): anonymized student responses to final test questions
The student's handwritten work was scanned, saved as pdf and coded in atlas.ti. These coded pdf's cannot be anonymized anymore, and thus not openly distributed or published.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset includes all anonymized quantitative student-level demographic data, and a codebook explaining the variables contained in the dataset.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The University Academic Misconduct Detector Dataset is designed to support research and development of AI-powered tools for identifying potential academic dishonesty in higher education institutions. With the rise of online and hybrid learning environments, universities face growing challenges in detecting plagiarism, contract cheating, and other forms of misconduct. This dataset provides a clean, well-structured collection of academic records, behavior logs, and submission patterns to enable data-driven solutions for academic integrity.
The dataset aims to help:
Researchers develop advanced machine learning models for detecting anomalies in student work.
Educators identify early warning signs of potential misconduct and provide timely interventions.
Universities create automated monitoring systems to flag suspicious activity in large cohorts.
Data Scientists explore classification, clustering, and anomaly detection techniques in an educational setting.
The dataset contains the following key components:
Student Metadata: Basic anonymized demographic details (e.g., age group, department, study year).
Assignment & Exam Submissions: Scores, submission times, and similarity index from plagiarism detection tools.
Behavioral Indicators: Login frequency, resource usage, and last-minute submission patterns.
Misconduct Labels: Ground truth annotations indicating whether a case was flagged as misconduct or not.
All data has been anonymized to ensure privacy and compliance with ethical guidelines.
Binary Classification: Predict whether a student’s submission involves misconduct.
Anomaly Detection: Spot unusual submission patterns compared to the class average.
Feature Engineering Practice: Extract meaningful features from behavioral logs for predictive modeling.
Explainable AI Research: Explore interpretable models for sensitive decision-making in education.
Academic misconduct is a growing global concern that threatens the credibility of higher education. By providing this dataset, the goal is to foster transparent, ethical, and AI-assisted solutions that help maintain fairness and academic excellence.
Data is synthetic/anonymized and does not represent real individuals.
Intended for educational and research purposes only.
Any deployment of models trained on this dataset should follow strict institutional ethics policies.
Facebook
TwitterOne of the key impacts of AMI technology is the availability of interval energy usage data, which can support the development of new products and services and to enable the market to deliver greater value to customers. Requestors can now access anonymized interval energy usage data in 30 minute intervals for all zip codes where AMI meters have been deployed.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The early identification of students facing learning difficulties is one of the most critical challenges in modern education. Intervening effectively requires leveraging data to understand the complex interplay between student demographics, engagement patterns, and academic performance.
This dataset was created to serve as a high-quality, pre-processed resource for building machine learning models to tackle this very problem. It is a unique hybrid dataset, meticulously crafted by unifying three distinct sources:
The Open University Learning Analytics Dataset (OULAD): A rich dataset detailing student interactions with a Virtual Learning Environment (VLE). We have aggregated the raw, granular data (over 10 million interaction logs) into powerful features, such as total clicks, average assessment scores, and distinct days of activity for each student registration.
The UCI Student Performance Dataset: A classic educational dataset containing demographic information and final grades in Portuguese and Math subjects from two Portuguese schools.
A Synthetic Data Component: A synthetically generated portion of the data, created to balance the dataset or represent specific student profiles.
A direct merge of these sources was not possible as the student identifiers were not shared. Instead, a strategy of intelligent concatenation was employed. The final dataset has undergone a rigorous pre-processing pipeline to make it immediately usable for machine learning tasks:
Advanced Imputation: Missing values were handled using a sophisticated iterative imputation method powered by Gaussian Mixture Models (GMM), ensuring the dataset's integrity.
One-Hot Encoding: All categorical features have been converted to a numerical format.
Feature Scaling: All numerical features have been standardized (using StandardScaler) to have a mean of 0 and a standard deviation of 1, preventing model bias from features with different scales.
The result is a clean, comprehensive dataset ready for modeling.
Each row represents a student profile, and the columns are the features and the target.
Features include aggregated online engagement metrics (e.g., clicks, distinct activities), academic performance (grades, scores), and student demographics (e.g., gender, age band). A key feature indicates the original data source (OULAD, UCI, Synthetic).
The dataset contains no Personally Identifiable Information (PII). Demographic information is presented in broad, anonymized categories.
Key Columns:
Target Variable:
had_difficulty: The primary target for classification. This binary variable has been engineered from the original final_result column of the OULAD dataset.
1: The student either failed (Fail) or withdrew (Withdrawn) from the course.
0: The student passed (Pass or Distinction).
Feature Groups:
OULAD Aggregated Features (e.g., oulad_total_cliques, oulad_media_notas): Quantitative metrics summarizing a student's engagement and performance within the VLE.
Academic Performance Features (e.g., nota_matematica_harmonizada): Harmonized grades from different data sources.
Demographic Features (e.g., gender_*, age_band_*): One-hot encoded columns representing student demographics.
Origin Features (e.g., origem_dado_OULAD, origem_dado_UCI): One-hot encoded columns indicating the original source of the data for each row. This allows for source-specific analysis.
(Note: All numerical feature names are post-scaling and may not directly reflect their original names. Please refer to the complete column list for details.)
This dataset would not be possible without the original data providers. Please acknowledge them in any work that uses this data:
OULAD Dataset: Kuzilek, J., Hlosta, M., and Zdrahal, Z. (2017). Open University Learning Analytics dataset. Scientific Data, 4. https://analyse.kmi.open.ac.uk/open_dataset
UCI Student Performance Dataset: P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS. https://archive.ics.uci.edu/ml/datasets/student+performance
This dataset is perfect for a variety of predictive modeling tasks. Here are a few ideas to get you started:
Can you build a classification model to predict had_difficulty with high recall? (Minimizing the number of at-risk students we fail to identify).
Which features are the most powerful predictors of student failure or withdrawal? (Feature Importance Analysis).
Can you build separate models for each data origin (origem_dado_*) and compare ...
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset card for Text Anonymization Benchmark (TAB) train
Dataset Summary
This is the training split of the Text Anonymisation Benchmark. As the title says it's a dataset focused on text anonymisation, specifcially European Court Documents, which contain labels by mutltiple annotators.
Supported Tasks and Leaderboards
[More Information Needed]
Languages
[More Information Needed]
Dataset Structure
Data Instances
[More Information… See the full description on the dataset page: https://huggingface.co/datasets/mattmdjaga/text-anonymization-benchmark-train.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A cross-sectional, mixed methods online survey was deployed to students in five universities in the NE. The survey explored whether students changed behaviour at university, what they changed and why they changed. Their engagement in physical activity, smoking, diet (consumption of fruit and vegetables) and alcohol consumption was assessed.
Facebook
Twitterhttps://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.11588/DATA/MXM0Q2https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.11588/DATA/MXM0Q2
In the publication [1] we implemented anonymization and synthetization techniques for a structured data set, which was collected during the HiGHmed Use Case Cardiology study [2]. We employed the data anonymization tool ARX [3] and the data synthetization framework ASyH [4] individually and in combination. We evaluated the utility and shortcomings of the different approaches by statistical analyses and privacy risk assessments. Data utility was assessed by computing two heart failure risk scores (Barcelona BioHF [5] and MAGGIC [6]) on the protected data sets. We observed only minimal deviations to scores from the original data set. Additionally, we performed a re-identification risk analysis and found only minor residual risks for common types of privacy threats. We could demonstrate that anonymization and synthetization methods protect privacy while retaining data utility for heart failure risk assessment. Both approaches and a combination thereof introduce only minimal deviations from the original data set over all features. While data synthesis techniques produce any number of new records, data anonymization techniques offer more formal privacy guarantees. Consequently, data synthesis on anonymized data further enhances privacy protection with little impacting data utility. We hereby share all generated data sets with the scientific community through a use and access agreement. [1] Johann TI, Otte K, Prasser F, Dieterich C: Anonymize or synthesize? Privacy-preserving methods for heart failure score analytics. Eur Heart J 2024;. doi://10.1093/ehjdh/ztae083 [2] Sommer KK, Amr A, Bavendiek, Beierle F, Brunecker P, Dathe H et al. Structured, harmonized, and interoperable integration of clinical routine data to compute heart failure risk scores. Life (Basel) 2022;12:749. [3] Prasser F, Eicher J, Spengler H, Bild R, Kuhn KA. Flexible data anonymization using ARX—current status and challenges ahead. Softw Pract Exper 2020;50:1277–1304. [4] Johann TI, Wilhelmi H. ASyH—anonymous synthesizer for health data, GitHub, 2023. Available at: https://github.com/dieterich-lab/ASyH. [5] Lupón J, de Antonio M, Vila J, Peñafiel J, Galán A, Zamora E, et al. Development of a novel heart failure risk tool: the Barcelona bio-heart failure risk calculator (BCN Bio-HF calculator). PLoS One 2014;9:e85466. [6] Pocock SJ, Ariti CA, McMurray JJV, Maggioni A, Køber L, Squire IB, et al. Predicting survival in heart failure: a risk score based on 39 372 patients from 30 studies. Eur Heart J 2013;34:1404–1413.