Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset card for Text Anonymization Benchmark (TAB) train
Dataset Summary
This is the training split of the Text Anonymisation Benchmark. As the title says it's a dataset focused on text anonymisation, specifcially European Court Documents, which contain labels by mutltiple annotators.
Supported Tasks and Leaderboards
[More Information Needed]
Languages
[More Information Needed]
Dataset Structure
Data Instances
[More Information… See the full description on the dataset page: https://huggingface.co/datasets/mattmdjaga/text-anonymization-benchmark-train.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset card for Text Anonymization Benchmark (TAB) Validation & Test
Dataset Summary
This is the validation and test split of the Text Anonymisation Benchmark. As the title says it's a dataset focused on text anonymisation, specifcially European Court Documents, which contain labels by mutltiple annotators.
Supported Tasks and Leaderboards
[More Information Needed]
Languages
[More Information Needed]
Dataset Structure
Data… See the full description on the dataset page: https://huggingface.co/datasets/mattmdjaga/text-anonymization-benchmark-val-test.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
For the purpose of research on data intermediaries and data anonymisation, it is necessary to test these processes with realistic video data containing personal data. For this purpose, the TreuMoDa project, funded by the German Federal Ministry of Education and Research (BMBF), has created a dataset of different traffic scenes containing identifiable persons.
This video data was collected at the Autonomous Driving Test Area Baden-Württemberg. On the one hand, it should be possible to recognise people in traffic, including their line of sight. On the other hand, it should be usable for the demonstration and evaluation of anonymisation techniques.
The legal basis for the publication of this data set the consent given by the participants as documented in the file Consent.pdf (all purposes) in accordance with Art. 6 1 (a) and Art. 9 2 (a) GDPR. Any further processing is subject to the GDPR.
We make this dataset available for non-commercial purposes such as teaching, research and scientific communication. Please note that this licence is limited by the provisions of the GDPR. Anyone downloading this data will become an independent controller of the data. This data has been collected with the consent of the identifiable individuals depicted.
Any consensual use must take into account the purposes mentioned in the uploaded consent forms and in the privacy terms and conditions provided to the participants (see Consent.pdf). All participants consented to all three purposes, and no consent was withdrawn at the time of publication. KIT is unable to provide you with contact details for any of the participants, as we have removed all links to personal data other than that contained in the published images.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Complete list of the data used in the WELLBASED Project after the anonymisation process. This dataset contains the anonymised quantitative data for survey data, indoor air quality indicators, health monitoring, sleep data and outdoor environmental quality indicators.
---------------------------------------------
The WELLBASED project, the “Project” is participating in the Horizon 2020 framework program for open science and innovation in the topic “SC1-BHC-29-2020-Innovative actions for improving urban health and wellbeing - addressing environment, climate and socioeconomic factors”. This topic is aimed to address health inequalities, improved physical or mental health, relevant socio-economic and/or environmental determinants of health, and the need for more systematic data collection on urban health across the EU.
WELLBASED had designed, implemented and evaluated a comprehensive urban programme to reduce energy poverty and their effects on the citizens health and wellbeing, built on evidence-based approaches in 6 different pilot cities, representing different urban realities but also a diverse range of welfare and healthcare models. The project’s multidisciplinary consortium, made up of 19 partners from 10 countries, has been built to guarantee the full coverage of the scientific, clinical, social and environmental competencies, and to gather the viewpoint of different communities and actors necessary to develop, test and evaluate the interventions related to WELLBASED in order to maximize its chances of success.
What is WELLBASED?
Improving health, wellbeing and equality by evidence-based urban policies for tackling energy poverty
WHAT:
WELLBASED is a European project funded under the HORIZON 2020 programme of the European Commission. The diverse team will design, implement and evaluate a comprehensive urban programme to significantly reduce energy poverty and its effects on the citizens health and wellbeing, built on evidence-based approaches in six pilot cities that represent not only different urban realities but also a diverse range of welfare and healthcare models.
WHY:
Energy poverty is becoming a main challenge of the European welfare systems and beyond, abounding on the inequalities derived from living conditions and social determinants, with a direct and negative impact on health and wellbeing, mainly in urban contexts. Health problems attributable to energy poverty include respiratory diseases, heart attacks, stroke and mental disorders (stress, anxiety, depression), but also acute health issues, such as hypothermia, injuries or influenza. The complex nature of this recently identified phenomenon requires a comprehensive analysis of the problem and its solution from a multidimensional approach, which should involve environmental, political, social, regulatory and psychological issues, thus involving other Social Determinants of Health and health inequalities. Urban policies and initiatives might respond very efficiently to energy poverty and their effects on the citizens wellbeing and health, by providing evidence-based interventions covering different angles of the challenges, including complementary actions covering individual (behavioural) but also social-political actions (regulations, urban planning) that include health in all policies.
HOW:
Based on the socioecological model, we test intervention schemes addressing energy poverty in 6 different pilot cities in Europe: Valencia (Spain), Heerlen (The Netherlands), Leeds (UK), Edirne (Turkey), Obuda (Hungary) and Jelgava (Latvia). The city of Skopje (North Macedonia) joins the pilots as an observer.
WHEN:
The responses to the tests used in this project and the values of the sensors and meters will be collected between September 2022 and September 2024.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains anonymised outgoing phone-call records from a single small-business phone line used by Walsoft Software Works cc (trading as Walsoft Computers). The company is a registered South African close corporation (Enterprise number B2004037443) co-owned by Philip Tambiti Leo Walekhwa and Shoni Reineth Walekhwa. The data covers calls made between January 2022 and October 2024. Each row represents one call, capturing the date, time, weekday, duration, a pseudonymous contact name, and a semi-manual relationship category (Family, Supplier, Important Contacts, Service Provider, or Unknown).
The dataset is designed as a semi-supervised learning playground. The labelled calls define clear relationship types, while the large Unknown group is deliberately broad and mixed. Inside Unknown you will typically find a blend of prospective clients, infrequent suppliers, acquaintances, institutions that have not yet been tagged, and other one-off or rarely contacted numbers.
This makes the dataset useful for several kinds of projects:
Classification of Unknown calls
Use the labelled calls to train models that predict full categories (or a simpler business vs non-business flag) for Unknown calls, based on features such as hour of day, day of week, duration, weekday/weekend, working-hours flags, and per-contact history.
Clustering and segmentation inside Unknown
Discover sub-groups within Unknown (for example, prospective clients, infrequent suppliers, acquaintances, one-off institutions) using time-of-day, weekday, duration, and contact-level features, and interpret clusters behaviourally.
Contact-level behaviour and segmentation
Aggregate over the anonymised dialled_phone_number to analyse how different contacts behave (frequency, total time, working-hours vs off-hours usage) and segment them into types such as core suppliers, key family members, heavy evening callers, or one-off numbers.
Productivity and potential misuse analysis
Estimate how much of the business line’s working-hours call time goes to clearly business categories (Supplier, Important Contacts, Service Provider) versus Family and Unknown, and explore whether some Unknown clusters look more like social/personal usage than business usage.
Time-evolution and yearly trends
Use the year, month, and date_stamp columns to compare behaviour across 2022–2024, such as changes in business vs non-business usage, weekend patterns, or the size of the Unknown group over time.
Labelling pipeline and active learning
Treat Unknown as a pool for incremental labelling: design rules or models to suggest labels, identify the most “informative” Unknown calls for manual review, and track how the proportion of Unknown decreases as the dataset becomes better labelled.
All phone numbers are anonymised to numeric identifiers (without preserving the original South African dialing structure), and names are pseudonyms that describe the relationship or role rather than revealing real identities. Compliance with South Africa’s POPIA (Protection of Personal Information Act) and European GDPR principles has been carefully considered: data collection occurred in South Africa, and the subsequent study and dataset preparation took place in Norway, with personal identifiers removed and only call metadata retained.
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
As part of its #DataSession on 27 and 28 May 2016, the financial courts decided to open the full text of their case law in open data. The format chosen is XML (or even HTML directly), which will allow the inclusion of metadata useful to re-users. Initially, the decisions of the regional and territorial chambers will be available in Word format but we are working on their conversion into XML or HTML.
These datasets concern the judgments of the regional and territorial chambers for 2015.
Please report any anonymization errors that you may detect, so that the dataset can be corrected.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset card for Text Anonymization Benchmark (TAB) train
Dataset Summary
This is the training split of the Text Anonymisation Benchmark. As the title says it's a dataset focused on text anonymisation, specifcially European Court Documents, which contain labels by mutltiple annotators.
Supported Tasks and Leaderboards
[More Information Needed]
Languages
[More Information Needed]
Dataset Structure
Data Instances
[More Information… See the full description on the dataset page: https://huggingface.co/datasets/mattmdjaga/text-anonymization-benchmark-train.