6 datasets found

h
text-anonymization-benchmark-train
huggingface.co
Updated Feb 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mateusz Dziemian (2025). text-anonymization-benchmark-train [Dataset]. https://huggingface.co/datasets/mattmdjaga/text-anonymization-benchmark-train
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 11, 2025
Authors
Mateusz Dziemian
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset card for Text Anonymization Benchmark (TAB) train

Dataset Summary

This is the training split of the Text Anonymisation Benchmark. As the title says it's a dataset focused on text anonymisation, specifcially European Court Documents, which contain labels by mutltiple annotators.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

[More Information Needed]

Dataset Structure Data Instances

[More Information… See the full description on the dataset page: https://huggingface.co/datasets/mattmdjaga/text-anonymization-benchmark-train.
h
text-anonymization-benchmark-val-test
huggingface.co
Updated Mar 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mateusz Dziemian (2024). text-anonymization-benchmark-val-test [Dataset]. https://huggingface.co/datasets/mattmdjaga/text-anonymization-benchmark-val-test
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 22, 2024
Authors
Mateusz Dziemian
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset card for Text Anonymization Benchmark (TAB) Validation & Test

Dataset Summary

This is the validation and test split of the Text Anonymisation Benchmark. As the title says it's a dataset focused on text anonymisation, specifcially European Court Documents, which contain labels by mutltiple annotators.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

[More Information Needed]

Dataset Structure Data… See the full description on the dataset page: https://huggingface.co/datasets/mattmdjaga/text-anonymization-benchmark-val-test.
Consensual videos of potentially re-identifiable individuals recorded at the...
zenodo.org
data.niaid.nih.gov
pdf, zip
Updated Jul 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vivien Geenen; Vivien Geenen; Till Riedel; Till Riedel (2024). Consensual videos of potentially re-identifiable individuals recorded at the Autonomous Driving Test Area Baden-Württemberg (raw images recorded daytime). [Dataset]. http://doi.org/10.5281/zenodo.10020644
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10020644
Dataset updated
Jul 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Vivien Geenen; Vivien Geenen; Till Riedel; Till Riedel
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Time period covered
Mar 30, 2023
Area covered
Baden-Württemberg
Description
For the purpose of research on data intermediaries and data anonymisation, it is necessary to test these processes with realistic video data containing personal data. For this purpose, the TreuMoDa project, funded by the German Federal Ministry of Education and Research (BMBF), has created a dataset of different traffic scenes containing identifiable persons.
This video data was collected at the Autonomous Driving Test Area Baden-Württemberg. On the one hand, it should be possible to recognise people in traffic, including their line of sight. On the other hand, it should be usable for the demonstration and evaluation of anonymisation techniques.
The legal basis for the publication of this data set the consent given by the participants as documented in the file Consent.pdf (all purposes) in accordance with Art. 6 1 (a) and Art. 9 2 (a) GDPR. Any further processing is subject to the GDPR.
We make this dataset available for non-commercial purposes such as teaching, research and scientific communication. Please note that this licence is limited by the provisions of the GDPR. Anyone downloading this data will become an independent controller of the data. This data has been collected with the consent of the identifiable individuals depicted.
Any consensual use must take into account the purposes mentioned in the uploaded consent forms and in the privacy terms and conditions provided to the participants (see Consent.pdf). All participants consented to all three purposes, and no consent was withdrawn at the time of publication. KIT is unable to provide you with contact details for any of the participants, as we have removed all links to personal data other than that contained in the published images.
Data from: Anonymized Dataset from WELLBASED Project
zenodo.org
producciocientifica.uv.es
+1more
txt
Updated Mar 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antonio Fernandez-Gimenez; Antonio Fernandez-Gimenez; Lucie Middlemiss; Lucie Middlemiss; Hande BARLIN; Hande BARLIN; Elena Petrova; Elena Petrova; Pilar Jorda Murciano; Noemí García Petit; Claudia Ferre; Marite Ievina; Guus Van der Nat; Barbara Somogyi; Josep Redon; Josep Redon; Amy Van Grieken; Amy Van Grieken; Ricard Martínez; Alma Virto; Pilar Jorda Murciano; Noemí García Petit; Claudia Ferre; Marite Ievina; Guus Van der Nat; Barbara Somogyi; Ricard Martínez; Alma Virto (2025). Anonymized Dataset from WELLBASED Project [Dataset]. http://doi.org/10.5281/zenodo.14918253
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14918253
Dataset updated
Mar 31, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Antonio Fernandez-Gimenez; Antonio Fernandez-Gimenez; Lucie Middlemiss; Lucie Middlemiss; Hande BARLIN; Hande BARLIN; Elena Petrova; Elena Petrova; Pilar Jorda Murciano; Noemí García Petit; Claudia Ferre; Marite Ievina; Guus Van der Nat; Barbara Somogyi; Josep Redon; Josep Redon; Amy Van Grieken; Amy Van Grieken; Ricard Martínez; Alma Virto; Pilar Jorda Murciano; Noemí García Petit; Claudia Ferre; Marite Ievina; Guus Van der Nat; Barbara Somogyi; Ricard Martínez; Alma Virto
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Sep 2022 - Sep 2024
Description
Complete list of the data used in the WELLBASED Project after the anonymisation process. This dataset contains the anonymised quantitative data for survey data, indoor air quality indicators, health monitoring, sleep data and outdoor environmental quality indicators.

---------------------------------------------

The WELLBASED project, the “Project” is participating in the Horizon 2020 framework program for open science and innovation in the topic “SC1-BHC-29-2020-Innovative actions for improving urban health and wellbeing - addressing environment, climate and socioeconomic factors”. This topic is aimed to address health inequalities, improved physical or mental health, relevant socio-economic and/or environmental determinants of health, and the need for more systematic data collection on urban health across the EU.

WELLBASED had designed, implemented and evaluated a comprehensive urban programme to reduce energy poverty and their effects on the citizens health and wellbeing, built on evidence-based approaches in 6 different pilot cities, representing different urban realities but also a diverse range of welfare and healthcare models. The project’s multidisciplinary consortium, made up of 19 partners from 10 countries, has been built to guarantee the full coverage of the scientific, clinical, social and environmental competencies, and to gather the viewpoint of different communities and actors necessary to develop, test and evaluate the interventions related to WELLBASED in order to maximize its chances of success.

What is WELLBASED?
Improving health, wellbeing and equality by evidence-based urban policies for tackling energy poverty

WHAT:
WELLBASED is a European project funded under the HORIZON 2020 programme of the European Commission. The diverse team will design, implement and evaluate a comprehensive urban programme to significantly reduce energy poverty and its effects on the citizens health and wellbeing, built on evidence-based approaches in six pilot cities that represent not only different urban realities but also a diverse range of welfare and healthcare models.

WHY:
Energy poverty is becoming a main challenge of the European welfare systems and beyond, abounding on the inequalities derived from living conditions and social determinants, with a direct and negative impact on health and wellbeing, mainly in urban contexts. Health problems attributable to energy poverty include respiratory diseases, heart attacks, stroke and mental disorders (stress, anxiety, depression), but also acute health issues, such as hypothermia, injuries or influenza. The complex nature of this recently identified phenomenon requires a comprehensive analysis of the problem and its solution from a multidimensional approach, which should involve environmental, political, social, regulatory and psychological issues, thus involving other Social Determinants of Health and health inequalities. Urban policies and initiatives might respond very efficiently to energy poverty and their effects on the citizens wellbeing and health, by providing evidence-based interventions covering different angles of the challenges, including complementary actions covering individual (behavioural) but also social-political actions (regulations, urban planning) that include health in all policies.

HOW:
Based on the socioecological model, we test intervention schemes addressing energy poverty in 6 different pilot cities in Europe: Valencia (Spain), Heerlen (The Netherlands), Leeds (UK), Edirne (Turkey), Obuda (Hungary) and Jelgava (Latvia). The city of Skopje (North Macedonia) joins the pilots as an observer.

WHEN:
The responses to the tests used in this project and the values of the sensors and meters will be collected between September 2022 and September 2024.
Walsoftai Semi-Categorized 1-User Call Behaviour
kaggle.com
zip
Updated Nov 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Walekhwa Tambiti Leo Philip (2025). Walsoftai Semi-Categorized 1-User Call Behaviour [Dataset]. https://www.kaggle.com/datasets/walekhwatlphilip/walsoftai-semi-categorized-1-user-call-behaviour
Explore at:
zip(2926052 bytes)Available download formats
Dataset updated
Nov 15, 2025
Authors
Walekhwa Tambiti Leo Philip
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset contains anonymised outgoing phone-call records from a single small-business phone line used by Walsoft Software Works cc (trading as Walsoft Computers). The company is a registered South African close corporation (Enterprise number B2004037443) co-owned by Philip Tambiti Leo Walekhwa and Shoni Reineth Walekhwa. The data covers calls made between January 2022 and October 2024. Each row represents one call, capturing the date, time, weekday, duration, a pseudonymous contact name, and a semi-manual relationship category (Family, Supplier, Important Contacts, Service Provider, or Unknown).

The dataset is designed as a semi-supervised learning playground. The labelled calls define clear relationship types, while the large Unknown group is deliberately broad and mixed. Inside Unknown you will typically find a blend of prospective clients, infrequent suppliers, acquaintances, institutions that have not yet been tagged, and other one-off or rarely contacted numbers.

This makes the dataset useful for several kinds of projects:

Classification of Unknown calls
Use the labelled calls to train models that predict full categories (or a simpler business vs non-business flag) for Unknown calls, based on features such as hour of day, day of week, duration, weekday/weekend, working-hours flags, and per-contact history.

Clustering and segmentation inside Unknown
Discover sub-groups within Unknown (for example, prospective clients, infrequent suppliers, acquaintances, one-off institutions) using time-of-day, weekday, duration, and contact-level features, and interpret clusters behaviourally.

Contact-level behaviour and segmentation
Aggregate over the anonymised dialled_phone_number to analyse how different contacts behave (frequency, total time, working-hours vs off-hours usage) and segment them into types such as core suppliers, key family members, heavy evening callers, or one-off numbers.

Productivity and potential misuse analysis
Estimate how much of the business line’s working-hours call time goes to clearly business categories (Supplier, Important Contacts, Service Provider) versus Family and Unknown, and explore whether some Unknown clusters look more like social/personal usage than business usage.

Time-evolution and yearly trends
Use the year, month, and date_stamp columns to compare behaviour across 2022–2024, such as changes in business vs non-business usage, weekend patterns, or the size of the Unknown group over time.

Labelling pipeline and active learning
Treat Unknown as a pool for incremental labelling: design rules or models to suggest labels, identify the most “informative” Unknown calls for manual review, and track how the proportion of Unknown decreases as the dataset becomes better labelled.

All phone numbers are anonymised to numeric identifiers (without preserving the original South African dialing structure), and names are pseudonyms that describe the relationship or role rather than revealing real identities. Compliance with South Africa’s POPIA (Protection of Personal Information Act) and European GDPR principles has been carefully considered: data collection occurred in South Africa, and the subsequent study and dataset preparation took place in Norway, with personal identifiers removed and only call metadata retained.
e
Anonymised judgments of the Regional and Territorial Audit Chambers (2015)
data.europa.eu
zip
Updated May 27, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cour des comptes (2016). Anonymised judgments of the Regional and Territorial Audit Chambers (2015) [Dataset]. https://data.europa.eu/data/datasets/5746f8ca88ee382b03d1b934/embed
Explore at:
zip(15264803)Available download formats
Dataset updated
May 27, 2016
Dataset authored and provided by
Cour des comptes
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
As part of its #DataSession on 27 and 28 May 2016, the financial courts decided to open the full text of their case law in open data. The format chosen is XML (or even HTML directly), which will allow the inclusion of metadata useful to re-users. Initially, the decisions of the regional and territorial chambers will be available in Word format but we are working on their conversion into XML or HTML.

These datasets concern the judgments of the regional and territorial chambers for 2015.

Please report any anonymization errors that you may detect, so that the dataset can be corrected.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Mateusz Dziemian (2025). text-anonymization-benchmark-train [Dataset]. https://huggingface.co/datasets/mattmdjaga/text-anonymization-benchmark-train

text-anonymization-benchmark-train

TAB

mattmdjaga/text-anonymization-benchmark-train

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Feb 11, 2025

Authors

Mateusz Dziemian

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset card for Text Anonymization Benchmark (TAB) train

  Dataset Summary

This is the training split of the Text Anonymisation Benchmark. As the title says it's a dataset focused on text anonymisation, specifcially European Court Documents, which contain labels by mutltiple annotators.

  Supported Tasks and Leaderboards

[More Information Needed]

  Languages

[More Information Needed]

  Dataset Structure





  Data Instances

[More Information… See the full description on the dataset page: https://huggingface.co/datasets/mattmdjaga/text-anonymization-benchmark-train.

Clear search

Close search

Google apps

Main menu

text-anonymization-benchmark-train

text-anonymization-benchmark-val-test

Consensual videos of potentially re-identifiable individuals recorded at the...

Data from: Anonymized Dataset from WELLBASED Project

Walsoftai Semi-Categorized 1-User Call Behaviour

Anonymised judgments of the Regional and Territorial Audit Chambers (2015)

text-anonymization-benchmark-train

TAB

mattmdjaga/text-anonymization-benchmark-train