6 datasets found
  1. h

    text-anonymization-benchmark-train

    • huggingface.co
    Updated Feb 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mateusz Dziemian (2025). text-anonymization-benchmark-train [Dataset]. https://huggingface.co/datasets/mattmdjaga/text-anonymization-benchmark-train
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 11, 2025
    Authors
    Mateusz Dziemian
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset card for Text Anonymization Benchmark (TAB) train

      Dataset Summary
    

    This is the training split of the Text Anonymisation Benchmark. As the title says it's a dataset focused on text anonymisation, specifcially European Court Documents, which contain labels by mutltiple annotators.

      Supported Tasks and Leaderboards
    

    [More Information Needed]

      Languages
    

    [More Information Needed]

      Dataset Structure
    
    
    
    
    
      Data Instances
    

    [More Information… See the full description on the dataset page: https://huggingface.co/datasets/mattmdjaga/text-anonymization-benchmark-train.

  2. h

    text-anonymization-benchmark-val-test

    • huggingface.co
    Updated Mar 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mateusz Dziemian (2024). text-anonymization-benchmark-val-test [Dataset]. https://huggingface.co/datasets/mattmdjaga/text-anonymization-benchmark-val-test
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 22, 2024
    Authors
    Mateusz Dziemian
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset card for Text Anonymization Benchmark (TAB) Validation & Test

      Dataset Summary
    

    This is the validation and test split of the Text Anonymisation Benchmark. As the title says it's a dataset focused on text anonymisation, specifcially European Court Documents, which contain labels by mutltiple annotators.

      Supported Tasks and Leaderboards
    

    [More Information Needed]

      Languages
    

    [More Information Needed]

      Dataset Structure
    
    
    
    
    
      Data… See the full description on the dataset page: https://huggingface.co/datasets/mattmdjaga/text-anonymization-benchmark-val-test.
    
  3. Consensual videos of potentially re-identifiable individuals recorded at the...

    • zenodo.org
    • data.niaid.nih.gov
    pdf, zip
    Updated Jul 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vivien Geenen; Vivien Geenen; Till Riedel; Till Riedel (2024). Consensual videos of potentially re-identifiable individuals recorded at the Autonomous Driving Test Area Baden-Württemberg (raw images recorded daytime). [Dataset]. http://doi.org/10.5281/zenodo.10020644
    Explore at:
    zip, pdfAvailable download formats
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Vivien Geenen; Vivien Geenen; Till Riedel; Till Riedel
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Time period covered
    Mar 30, 2023
    Area covered
    Baden-Württemberg
    Description

    For the purpose of research on data intermediaries and data anonymisation, it is necessary to test these processes with realistic video data containing personal data. For this purpose, the TreuMoDa project, funded by the German Federal Ministry of Education and Research (BMBF), has created a dataset of different traffic scenes containing identifiable persons.

    This video data was collected at the Autonomous Driving Test Area Baden-Württemberg. On the one hand, it should be possible to recognise people in traffic, including their line of sight. On the other hand, it should be usable for the demonstration and evaluation of anonymisation techniques.

    The legal basis for the publication of this data set the consent given by the participants as documented in the file Consent.pdf (all purposes) in accordance with Art. 6 1 (a) and Art. 9 2 (a) GDPR. Any further processing is subject to the GDPR.

    We make this dataset available for non-commercial purposes such as teaching, research and scientific communication. Please note that this licence is limited by the provisions of the GDPR. Anyone downloading this data will become an independent controller of the data. This data has been collected with the consent of the identifiable individuals depicted.

    Any consensual use must take into account the purposes mentioned in the uploaded consent forms and in the privacy terms and conditions provided to the participants (see Consent.pdf). All participants consented to all three purposes, and no consent was withdrawn at the time of publication. KIT is unable to provide you with contact details for any of the participants, as we have removed all links to personal data other than that contained in the published images.

  4. Data from: Anonymized Dataset from WELLBASED Project

    • zenodo.org
    • producciocientifica.uv.es
    • +1more
    txt
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antonio Fernandez-Gimenez; Antonio Fernandez-Gimenez; Lucie Middlemiss; Lucie Middlemiss; Hande BARLIN; Hande BARLIN; Elena Petrova; Elena Petrova; Pilar Jorda Murciano; Noemí García Petit; Claudia Ferre; Marite Ievina; Guus Van der Nat; Barbara Somogyi; Josep Redon; Josep Redon; Amy Van Grieken; Amy Van Grieken; Ricard Martínez; Alma Virto; Pilar Jorda Murciano; Noemí García Petit; Claudia Ferre; Marite Ievina; Guus Van der Nat; Barbara Somogyi; Ricard Martínez; Alma Virto (2025). Anonymized Dataset from WELLBASED Project [Dataset]. http://doi.org/10.5281/zenodo.14918253
    Explore at:
    txtAvailable download formats
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Antonio Fernandez-Gimenez; Antonio Fernandez-Gimenez; Lucie Middlemiss; Lucie Middlemiss; Hande BARLIN; Hande BARLIN; Elena Petrova; Elena Petrova; Pilar Jorda Murciano; Noemí García Petit; Claudia Ferre; Marite Ievina; Guus Van der Nat; Barbara Somogyi; Josep Redon; Josep Redon; Amy Van Grieken; Amy Van Grieken; Ricard Martínez; Alma Virto; Pilar Jorda Murciano; Noemí García Petit; Claudia Ferre; Marite Ievina; Guus Van der Nat; Barbara Somogyi; Ricard Martínez; Alma Virto
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Sep 2022 - Sep 2024
    Description

    Complete list of the data used in the WELLBASED Project after the anonymisation process. This dataset contains the anonymised quantitative data for survey data, indoor air quality indicators, health monitoring, sleep data and outdoor environmental quality indicators.

    ---------------------------------------------

    The WELLBASED project, the “Project” is participating in the Horizon 2020 framework program for open science and innovation in the topic “SC1-BHC-29-2020-Innovative actions for improving urban health and wellbeing - addressing environment, climate and socioeconomic factors”. This topic is aimed to address health inequalities, improved physical or mental health, relevant socio-economic and/or environmental determinants of health, and the need for more systematic data collection on urban health across the EU.

    WELLBASED had designed, implemented and evaluated a comprehensive urban programme to reduce energy poverty and their effects on the citizens health and wellbeing, built on evidence-based approaches in 6 different pilot cities, representing different urban realities but also a diverse range of welfare and healthcare models. The project’s multidisciplinary consortium, made up of 19 partners from 10 countries, has been built to guarantee the full coverage of the scientific, clinical, social and environmental competencies, and to gather the viewpoint of different communities and actors necessary to develop, test and evaluate the interventions related to WELLBASED in order to maximize its chances of success.

    What is WELLBASED?
    Improving health, wellbeing and equality by evidence-based urban policies for tackling energy poverty

    WHAT:
    WELLBASED is a European project funded under the HORIZON 2020 programme of the European Commission. The diverse team will design, implement and evaluate a comprehensive urban programme to significantly reduce energy poverty and its effects on the citizens health and wellbeing, built on evidence-based approaches in six pilot cities that represent not only different urban realities but also a diverse range of welfare and healthcare models.

    WHY:
    Energy poverty is becoming a main challenge of the European welfare systems and beyond, abounding on the inequalities derived from living conditions and social determinants, with a direct and negative impact on health and wellbeing, mainly in urban contexts. Health problems attributable to energy poverty include respiratory diseases, heart attacks, stroke and mental disorders (stress, anxiety, depression), but also acute health issues, such as hypothermia, injuries or influenza. The complex nature of this recently identified phenomenon requires a comprehensive analysis of the problem and its solution from a multidimensional approach, which should involve environmental, political, social, regulatory and psychological issues, thus involving other Social Determinants of Health and health inequalities. Urban policies and initiatives might respond very efficiently to energy poverty and their effects on the citizens wellbeing and health, by providing evidence-based interventions covering different angles of the challenges, including complementary actions covering individual (behavioural) but also social-political actions (regulations, urban planning) that include health in all policies.

    HOW:
    Based on the socioecological model, we test intervention schemes addressing energy poverty in 6 different pilot cities in Europe: Valencia (Spain), Heerlen (The Netherlands), Leeds (UK), Edirne (Turkey), Obuda (Hungary) and Jelgava (Latvia). The city of Skopje (North Macedonia) joins the pilots as an observer.

    WHEN:
    The responses to the tests used in this project and the values of the sensors and meters will be collected between September 2022 and September 2024.

  5. Walsoftai Semi-Categorized 1-User Call Behaviour

    • kaggle.com
    zip
    Updated Nov 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Walekhwa Tambiti Leo Philip (2025). Walsoftai Semi-Categorized 1-User Call Behaviour [Dataset]. https://www.kaggle.com/datasets/walekhwatlphilip/walsoftai-semi-categorized-1-user-call-behaviour
    Explore at:
    zip(2926052 bytes)Available download formats
    Dataset updated
    Nov 15, 2025
    Authors
    Walekhwa Tambiti Leo Philip
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains anonymised outgoing phone-call records from a single small-business phone line used by Walsoft Software Works cc (trading as Walsoft Computers). The company is a registered South African close corporation (Enterprise number B2004037443) co-owned by Philip Tambiti Leo Walekhwa and Shoni Reineth Walekhwa. The data covers calls made between January 2022 and October 2024. Each row represents one call, capturing the date, time, weekday, duration, a pseudonymous contact name, and a semi-manual relationship category (Family, Supplier, Important Contacts, Service Provider, or Unknown).

    The dataset is designed as a semi-supervised learning playground. The labelled calls define clear relationship types, while the large Unknown group is deliberately broad and mixed. Inside Unknown you will typically find a blend of prospective clients, infrequent suppliers, acquaintances, institutions that have not yet been tagged, and other one-off or rarely contacted numbers.

    This makes the dataset useful for several kinds of projects:

    • Classification of Unknown calls
      Use the labelled calls to train models that predict full categories (or a simpler business vs non-business flag) for Unknown calls, based on features such as hour of day, day of week, duration, weekday/weekend, working-hours flags, and per-contact history.

    • Clustering and segmentation inside Unknown
      Discover sub-groups within Unknown (for example, prospective clients, infrequent suppliers, acquaintances, one-off institutions) using time-of-day, weekday, duration, and contact-level features, and interpret clusters behaviourally.

    • Contact-level behaviour and segmentation
      Aggregate over the anonymised dialled_phone_number to analyse how different contacts behave (frequency, total time, working-hours vs off-hours usage) and segment them into types such as core suppliers, key family members, heavy evening callers, or one-off numbers.

    • Productivity and potential misuse analysis
      Estimate how much of the business line’s working-hours call time goes to clearly business categories (Supplier, Important Contacts, Service Provider) versus Family and Unknown, and explore whether some Unknown clusters look more like social/personal usage than business usage.

    • Time-evolution and yearly trends
      Use the year, month, and date_stamp columns to compare behaviour across 2022–2024, such as changes in business vs non-business usage, weekend patterns, or the size of the Unknown group over time.

    • Labelling pipeline and active learning
      Treat Unknown as a pool for incremental labelling: design rules or models to suggest labels, identify the most “informative” Unknown calls for manual review, and track how the proportion of Unknown decreases as the dataset becomes better labelled.

    All phone numbers are anonymised to numeric identifiers (without preserving the original South African dialing structure), and names are pseudonyms that describe the relationship or role rather than revealing real identities. Compliance with South Africa’s POPIA (Protection of Personal Information Act) and European GDPR principles has been carefully considered: data collection occurred in South Africa, and the subsequent study and dataset preparation took place in Norway, with personal identifiers removed and only call metadata retained.

  6. e

    Anonymised judgments of the Regional and Territorial Audit Chambers (2015)

    • data.europa.eu
    zip
    Updated May 27, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cour des comptes (2016). Anonymised judgments of the Regional and Territorial Audit Chambers (2015) [Dataset]. https://data.europa.eu/data/datasets/5746f8ca88ee382b03d1b934/embed
    Explore at:
    zip(15264803)Available download formats
    Dataset updated
    May 27, 2016
    Dataset authored and provided by
    Cour des comptes
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    As part of its #DataSession on 27 and 28 May 2016, the financial courts decided to open the full text of their case law in open data. The format chosen is XML (or even HTML directly), which will allow the inclusion of metadata useful to re-users. Initially, the decisions of the regional and territorial chambers will be available in Word format but we are working on their conversion into XML or HTML.

    These datasets concern the judgments of the regional and territorial chambers for 2015.

    Please report any anonymization errors that you may detect, so that the dataset can be corrected.

  7. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Mateusz Dziemian (2025). text-anonymization-benchmark-train [Dataset]. https://huggingface.co/datasets/mattmdjaga/text-anonymization-benchmark-train

text-anonymization-benchmark-train

TAB

mattmdjaga/text-anonymization-benchmark-train

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 11, 2025
Authors
Mateusz Dziemian
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset card for Text Anonymization Benchmark (TAB) train

  Dataset Summary

This is the training split of the Text Anonymisation Benchmark. As the title says it's a dataset focused on text anonymisation, specifcially European Court Documents, which contain labels by mutltiple annotators.

  Supported Tasks and Leaderboards

[More Information Needed]

  Languages

[More Information Needed]

  Dataset Structure





  Data Instances

[More Information… See the full description on the dataset page: https://huggingface.co/datasets/mattmdjaga/text-anonymization-benchmark-train.

Search
Clear search
Close search
Google apps
Main menu