https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
As of 2023, the global Data De-Identification or Pseudonymity Software market is valued at approximately USD 1.5 billion and is projected to grow at a robust CAGR of 18% from 2024 to 2032, driven by increasing data privacy concerns and stringent regulatory requirements.
The growth of the Data De-Identification or Pseudonymity Software market is primarily fueled by the exponential increase in data generation across industries. With the advent of IoT, AI, and digital transformation strategies, the volume of data generated has seen an unprecedented spike. Organizations are now more aware of the need to protect sensitive information to comply with global data privacy regulations such as GDPR in Europe and CCPA in California. The need to ensure that personal data is anonymized or de-identified before analysis or sharing has escalated, pushing the demand for these software solutions.
Another significant growth factor is the rising number of cyber-attacks and data breaches. As data becomes more valuable, it also becomes a prime target for cybercriminals. In response, companies are investing heavily in data privacy and security measures, including de-identification and pseudonymity solutions, to mitigate risks associated with data breaches. This trend is more prevalent in sectors dealing with highly sensitive information like healthcare, finance, and government. Ensuring that data remains secure and private while being useful for analytics is a key driver for the adoption of these technologies.
Moreover, the evolution of Big Data analytics and cloud computing is also spurring growth in this market. As organizations move their operations to the cloud and leverage big data for decision-making, the importance of maintaining data privacy while utilizing large datasets for analytics cannot be overstated. Cloud-based de-identification solutions offer scalability, flexibility, and cost-effectiveness, making them increasingly popular among enterprises of all sizes. This shift towards cloud deployments is expected to further boost market growth.
Regionally, North America holds the largest market share due to its advanced technological infrastructure and stringent data protection laws. The presence of major technology companies and a high rate of adoption of advanced solutions in the U.S. and Canada contribute significantly to regional market growth. Europe follows closely, driven by rigorous GDPR compliance requirements. The Asia Pacific region is anticipated to witness the fastest growth, attributed to the increasing digitization and growing awareness about data privacy in countries like India and China.
As organizations increasingly seek to protect their sensitive data, the concept of Data Protection on Demand is gaining traction. This model allows businesses to access data protection services as and when needed, providing flexibility and scalability. By leveraging cloud-based platforms, companies can implement robust data protection measures without the need for significant upfront investments in infrastructure. This approach not only ensures compliance with data privacy regulations but also offers a cost-effective solution for managing data security. As the demand for on-demand services continues to rise, Data Protection on Demand is poised to become a critical component of data management strategies across various industries.
The Data De-Identification or Pseudonymity Software market by component is segmented into software and services. The software segment dominates the market, driven by the increasing need for automated solutions that ensure data privacy. These software solutions come with a variety of tools and features designed to anonymize or pseudonymize data efficiently, making them essential for organizations managing large volumes of sensitive information. The software market is expanding rapidly, with new innovations and improvements constantly being introduced to enhance functionality and user experience.
The services segment, though smaller compared to software, plays a crucial role in the market. Services include consulting, implementation, and maintenance, which are essential for the successful deployment and operation of de-identification software. These services help organizations tailor the software to their specific needs, ensuring compliance with regional and industry-specific data protection regulations.
This de-duped Geospatial Mobility dataset is derived from first-party, consented, mobile app data. This data is de-identified prior to Intuizi processing it, and is the highest level, least aggregated dataset that we are able to provide to our customers.
Intuizi customers use this data for many purposes, primarily to understand - at as granular a level as possible - the mobility patterns of de-identified mobile devices in specific countries.
This is incredibly useful for understanding visitation patterns to specific locations in particular territories or regions.
Some of our customers may, in addition, have their own first-party dataset that they want to compare/contrast to a high-level set of de-identified data, thus enriching their existing dataset. They may want to compare visitation to (their own, or other specific) locations to those owned/operated by competitors; or understand where else the devices that show up in their owned/operated locations also happen to go. Please note: re-identification of an individual is contractually prohibited.
When processed against PoI data, it is used to generate our Visualisation Details Dataset, which is then used to create visualisations within our visualisation platform. It can also be further refined, for use as our Postal Origin or Country of Origin Dataset.
The Intuizi De-identified Signals Dataset comprises fully-consented mobile device data, de-identified at source by the entity which has legal consent to own/process such data, and on who’s behalf we work to create a de-identified dataset of Encrypted ID visitation/mobility data.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global Data De-identification & Pseudonymity Software Market is projected to reach USD 3.5 billion by 2032, growing at a CAGR of 15.2% from 2024 to 2032. The rise in data privacy regulations and the increasing need for securing sensitive information are key factors driving this growth.
The accelerating pace of digital transformation across various industries has led to an unprecedented surge in data generation. This voluminous data often contains sensitive information that needs robust protection. The growing awareness regarding data privacy and stringent regulations like GDPR in Europe, CCPA in California, and other data protection laws worldwide are compelling organizations to adopt advanced data de-identification and pseudonymity software. These solutions ensure that sensitive data is anonymized or pseudonymized, thus mitigating the risk of data breaches and ensuring compliance with regulations. Consequently, the adoption of data de-identification and pseudonymity software is rapidly increasing.
Another significant growth factor is the increased focus on data security by industries such as healthcare, finance, and government. In healthcare, the protection of patient data is paramount, making the industry a significant consumer of de-identification software. Similarly, in the finance sector, protecting customer information is crucial to maintain trust and comply with regulatory requirements. Government agencies dealing with citizen data are also increasingly investing in these technologies to prevent unauthorized access and misuse of sensitive information. The demand for data de-identification and pseudonymity software is thus witnessing a steady rise across these critical sectors.
Technological advancements and innovation in data security solutions are further propelling market growth. The integration of artificial intelligence and machine learning into de-identification and pseudonymity software has enhanced their effectiveness and efficiency. These advanced technologies enable more accurate and faster processing of large datasets, thereby offering robust data protection. Additionally, the rise of cloud computing and the increasing adoption of cloud-based solutions provide scalable and cost-effective options for organizations, further driving the market.
In this context, the role of Identity Information Protection Service becomes increasingly crucial. As organizations strive to safeguard sensitive data, these services provide an essential layer of security by ensuring that identity-related information is protected from unauthorized access and misuse. Identity Information Protection Service helps organizations comply with data privacy regulations by offering robust solutions that secure personal identifiers, thus reducing the risk of identity theft and data breaches. By integrating these services, companies can enhance their data protection strategies, ensuring that identity information remains confidential and secure across various platforms and applications.
Regionally, North America holds the largest market share, driven by stringent data protection regulations and high adoption rates of advanced technologies. Europe follows, with significant contributions from countries like Germany, the UK, and France, driven by GDPR compliance requirements. The Asia Pacific region is expected to witness the highest growth rate due to the rapid digitalization of economies like China and India, coupled with increasing awareness about data privacy. Latin America and the Middle East & Africa regions are also showing promising growth, albeit from a smaller base.
The Data De-identification & Pseudonymity Software Market by component is segmented into software and services. The software segment includes standalone software solutions designed to de-identify or pseudonymize data. This segment is witnessing substantial growth due to the increasing demand for automated and scalable data protection solutions. The software solutions are enhanced with advanced algorithms and AI capabilities, providing accurate de-identification and pseudonymization of large datasets, which is crucial for organizations dealing with massive amounts of sensitive data.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
About
The following datasets were captured at a busy Belgian train station between 9pm and 10pm, it contains all 802.11 management frames that were captured. both datasets were captured with approximately 20 minutes between then.
Both datasets are represented by a pcap and CSV file. The CSV file contains the frame type, timestamps, signal strength, SSID and MAC addresses for every frame. In the pcap file, all generic 802.11 elements were removed for anonymization purposes.
Anonymization
All frames were anonymized by removing identifying information or renaming identifiers. Concretely, the following transformations were applied to both datasets:
In the pcap file, anonymization actions could lead to "corrupted" frames because length tags do not correspond with the actual data. However, the file and its frames are still readable in packet analyzing tools such as Wireshark or Scapy.
The script which was used to anonymize is available in the dataset.
Data
N/o | Dataset 1 | dataset 2 |
---|---|---|
Frames | 36306 | 60984 |
Beacon frames | 19693 | 27983 |
Request frames | 798 | 1580 |
Response frames | 15815 | 31421 |
Identified Wi-Fi Networks | 54 | 70 |
Identified MAC addresses | 2092 | 2705 |
Identified Wireless devices | 128 | 186 |
Capturetime | 480s | 422s |
Dataset contents
The two datasets are stored in the directories `1/` and `2/`. Each directory contains:
`anonymization.py` is the script which was used to remove identifiers.
`README.md` contains the documentation about the datasets
License
Copyright 2022-2023 Benjamin Vermunicht, Beat Signer, Maxim Van de Wynckel, Vrije Universiteit Brussel
Permission is hereby granted, free of charge, to any person obtaining a copy of this dataset and associated documentation files (the “Dataset”), to deal in the Dataset without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Dataset, and to permit persons to whom the Dataset is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions that make use of the Dataset.
THE DATASET IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE DATASET OR THE USE OR OTHER DEALINGS IN THE DATASET.
According to our latest research, the global medical imaging de-identification software market size reached USD 315 million in 2024, driven by the increasing adoption of digital healthcare solutions and stringent regulatory requirements for patient data privacy. The market is expected to grow at a robust CAGR of 13.2% during the forecast period, reaching approximately USD 858 million by 2033. The primary growth factor fueling this expansion is the rising volume of medical imaging data and the escalating need to ensure compliance with data protection laws such as HIPAA, GDPR, and other regional regulations.
The growth trajectory of the medical imaging de-identification software market is underpinned by the exponential increase in digital imaging procedures across healthcare facilities worldwide. As advanced imaging modalities like MRI, CT, and PET scans become standard in diagnostic workflows, the volume of data generated has surged. This data often contains sensitive patient information, making it imperative for healthcare organizations to adopt robust de-identification solutions. The proliferation of health information exchanges and the increasing emphasis on interoperability have further heightened the need for secure and compliant data sharing. These factors collectively foster a conducive environment for the adoption of de-identification software, as organizations seek to balance data utility with stringent privacy requirements.
Another major driver is the evolving regulatory landscape that mandates strict adherence to patient confidentiality and data protection standards. Regulatory frameworks such as the Health Insurance Portability and Accountability Act (HIPAA) in the United States, the General Data Protection Regulation (GDPR) in Europe, and similar regulations in Asia Pacific and other regions are compelling healthcare providers and research institutions to implement advanced de-identification solutions. These regulations impose hefty penalties for non-compliance, further incentivizing investments in software that can automate and streamline the de-identification process. Moreover, the growing trend of collaborative research and data sharing among healthcare entities necessitates reliable de-identification tools to facilitate secure and lawful data exchange.
Technological advancements in artificial intelligence and machine learning are also playing a pivotal role in shaping the medical imaging de-identification software market. Modern solutions leverage AI-driven algorithms to enhance the accuracy and efficiency of de-identification processes, reducing the risk of inadvertent data leaks. These innovations are particularly valuable in large-scale research projects, where massive datasets must be anonymized rapidly and without compromising data integrity. Furthermore, the integration of de-identification software with existing healthcare IT infrastructure, such as PACS and EHR systems, is becoming increasingly seamless, making adoption easier for end-users. This technological evolution is expected to drive further market growth over the next decade.
From a regional perspective, North America currently dominates the medical imaging de-identification software market, accounting for the largest share in 2024. The region’s leadership is attributed to the presence of advanced healthcare infrastructure, high adoption rates of digital health technologies, and stringent regulatory frameworks. Europe follows closely, propelled by GDPR compliance and increasing investments in healthcare IT. The Asia Pacific region is experiencing the fastest growth, fueled by expanding healthcare access, rapid digitalization, and rising awareness of data privacy. Latin America and the Middle East & Africa are also witnessing gradual adoption, supported by ongoing healthcare modernization initiatives and regulatory developments.
The component segment of the medical imaging de-i
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
The CARMEN-I corpus comprises 2,000 clinical records, encompassing discharge letters, referrals, and radiology reports from Hospital Clínic of Barcelona between March 2020 and March 2022. These reports, primarily in Spanish with some Catalan sections, cover COVID-19 patients with diverse comorbidities like kidney failure, cardiovascular diseases, malignancies, and immunosuppression. The corpus underwent thorough anonymization, validation, and expert annotation, replacing sensitive data with synthetic equivalents. A subset of the corpus features annotations of medical concepts by specialists, encompassing symptoms, diseases, procedures, medications, species, and humans (including family members). CARMEN-I serves as a valuable resource for training and assessing clinical NLP techniques and language models, aiding tasks like de-identification, concept detection, linguistic modifier extraction, document classification, and more. It also facilitates training researchers in clinical NLP and is a collaborative effort involving Barcelona Supercomputing Center's NLP4BIA team, Hospital Clínic, and Universitat de Barcelona's CLiC group.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
About the NUDA DatasetMedia bias is a multifaceted problem, leading to one-sided views and impacting decision-making. A way to address bias in news articles is to automatically detect and indicate it through machine-learning methods. However, such detection is limited due to the difficulty of obtaining reliable training data. To facilitate the data-gathering process, we introduce NewsUnravel, a news-reading web application leveraging an initially tested feedback mechanism to collect reader feedback on machine-generated bias highlights within news articles. Our approach augments dataset quality by significantly increasing inter-annotator agreement by 26.31% and improving classifier performance by 2.49%. As the first human-in-the-loop application for media bias, NewsUnravel shows that a user-centric approach to media bias data collection can return reliable data while being scalable and evaluated as easy to use. NewsUnravel demonstrates that feedback mechanisms are a promising strategy to reduce data collection expenses, fluidly adapt to changes in language, and enhance evaluators' diversity.
General
This dataset was created through user feedback on automatically generated bias highlights on news articles on the website NewsUnravel made by ANON. Its goal is to improve the detection of linguistic media bias for analysis and to indicate it to the public. Support came from ANON. None of the funders played any role in the dataset creation process or publication-related decisions.
The dataset consists of text, namely biased sentences with binary bias labels (processed, biased or not biased) as well as metadata about the article. It includes all feedback that was given. The single ratings (unprocessed) used to create the labels with correlating User IDs are included.
For training, this dataset was combined with the BABE dataset. All data is completely anonymous. Some sentences might be offensive or triggering as they were taken from biased or more extreme news sources. The dataset does not identify sub-populations or can be considered sensitive to them, nor is it possible to identify individuals.
Description of the Data Files
This repository contains the datasets for the anonymous NewsUnravel submission. The tables contain the following data:
NUDAdataset.csv: the NUDA dataset with 310 new sentences with bias labelsStatistics.png: contains all Umami statistics for NewsUnravel's usage dataFeedback.csv: holds the participantID of a single feedback with the sentence ID (contentId), the bias rating, and provided reasonsContent.csv: holds the participant ID of a rating with the sentence ID (contentId) of a rated sentence and the bias rating, and reason, if givenArticle.csv: holds the article ID, title, source, article metadata, article topic, and bias amount in %Participant.csv: holds the participant IDs and data processing consent
Collection Process
Data was collected through interactions with the Feedback Mechanism on NewsUnravel. A news article was displayed with automatically generated bias highlights. Each highlight could be selected, and readers were able to agree or disagree with the automatic label. Through a majority vote, labels were generated from those feedback interactions. Spammers were excluded through a spam detection approach.
Readers came to our website voluntarily through posts on LinkedIn and social media as well as posts on university boards. The data collection period lasted for one week, from March 4th to March 11th (2023). The landing page informed them about the goal and the data processing. After being informed, they could proceed to the article overview.
So far, the dataset has been used on top of BABE to train a linguistic bias classifier, adopting hyperparameter configurations from BABE with a pre-trained model from Hugging Face.The dataset will be open source. On acceptance, a link with all details and contact information will be provided. No third parties are involved.
The dataset will not be maintained as it captures the first test of NewsUnravel at a specific point in time. However, new datasets will arise from further iterations. Those will be linked in the repository. Please cite the NewsUnravel paper if you use the dataset and contact us if you're interested in more information or joining the project.
https://opcrd.co.uk/our-database/data-requests/https://opcrd.co.uk/our-database/data-requests/
About OPCRD
Optimum Patient Care Research Database (OPCRD) is a real-world, longitudinal, research database that provides anonymised data to support scientific, medical, public health and exploratory research. OPCRD is established, funded and maintained by Optimum Patient Care Limited (OPC) – which is a not-for-profit social enterprise that has been providing quality improvement programmes and research support services to general practices across the UK since 2005.
Key Features of OPCRD
OPCRD has been purposefully designed to facilitate real-world data collection and address the growing demand for observational and pragmatic medical research, both in the UK and internationally. Data held in OPCRD is representative of routine clinical care and thus enables the study of ‘real-world’ effectiveness and health care utilisation patterns for chronic health conditions.
OPCRD unique qualities which set it apart from other research data resources: • De-identified electronic medical records of more than 24.9 million patients • OPCRD covers all major UK primary care clinical systems • OPCRD covers approximately 35% of the UK population • One of the biggest primary care research networks in the world, with over 1,175 practices • Linked patient reported outcomes for over 68,000 patients including Covid-19 patient reported data • Linkage to secondary care data sources including Hospital Episode Statistics (HES)
Data Available in OPCRD
OPCRD has received data contributions from over 1,175 practices and currently holds de-identified research ready data for over 24.9 million patients or data subjects. This includes longitudinal primary care patient data and any data relevant to the management of patients in primary care, and thus covers all conditions. The data is derived from both electronic health records (EHR) data and patient reported data from patient questionnaires delivered as part of quality improvement. OPCRD currently holds over 68,000 patient reported questionnaire data on Covid-19, asthma, COPD and rare diseases.
Approvals and Governance
OPCRD has NHS research ethics committee (REC) approval to provide anonymised data for scientific and medical research since 2010, with its most recent approval in 2020 (NHS HRA REC ref: 20/EM/0148). OPCRD is governed by the Anonymised Data Ethics and Protocols Transparency committee (ADEPT). All research conducted using anonymised data from OPCRD must gain prior approval from ADEPT. Proceeds from OPCRD data access fees and detailed feasibility assessments are re-invested into OPC services for the continued free provision of patient quality improvement programmes for contributing practices and patients.
For more information on OPCRD please visit: https://opcrd.co.uk/
https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.11588/DATA/MXM0Q2https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.11588/DATA/MXM0Q2
In the publication [1] we implemented anonymization and synthetization techniques for a structured data set, which was collected during the HiGHmed Use Case Cardiology study [2]. We employed the data anonymization tool ARX [3] and the data synthetization framework ASyH [4] individually and in combination. We evaluated the utility and shortcomings of the different approaches by statistical analyses and privacy risk assessments. Data utility was assessed by computing two heart failure risk scores (Barcelona BioHF [5] and MAGGIC [6]) on the protected data sets. We observed only minimal deviations to scores from the original data set. Additionally, we performed a re-identification risk analysis and found only minor residual risks for common types of privacy threats. We could demonstrate that anonymization and synthetization methods protect privacy while retaining data utility for heart failure risk assessment. Both approaches and a combination thereof introduce only minimal deviations from the original data set over all features. While data synthesis techniques produce any number of new records, data anonymization techniques offer more formal privacy guarantees. Consequently, data synthesis on anonymized data further enhances privacy protection with little impacting data utility. We hereby share all generated data sets with the scientific community through a use and access agreement. [1] Johann TI, Otte K, Prasser F, Dieterich C: Anonymize or synthesize? Privacy-preserving methods for heart failure score analytics. Eur Heart J 2024;. doi://10.1093/ehjdh/ztae083 [2] Sommer KK, Amr A, Bavendiek, Beierle F, Brunecker P, Dathe H et al. Structured, harmonized, and interoperable integration of clinical routine data to compute heart failure risk scores. Life (Basel) 2022;12:749. [3] Prasser F, Eicher J, Spengler H, Bild R, Kuhn KA. Flexible data anonymization using ARX—current status and challenges ahead. Softw Pract Exper 2020;50:1277–1304. [4] Johann TI, Wilhelmi H. ASyH—anonymous synthesizer for health data, GitHub, 2023. Available at: https://github.com/dieterich-lab/ASyH. [5] Lupón J, de Antonio M, Vila J, Peñafiel J, Galán A, Zamora E, et al. Development of a novel heart failure risk tool: the Barcelona bio-heart failure risk calculator (BCN Bio-HF calculator). PLoS One 2014;9:e85466. [6] Pocock SJ, Ariti CA, McMurray JJV, Maggioni A, Køber L, Squire IB, et al. Predicting survival in heart failure: a risk score based on 39 372 patients from 30 studies. Eur Heart J 2013;34:1404–1413.
Phase 2 of the WholeTraveler study consisted of a global positioning system (GPS) data collection. Phase 2 started immediately after the completion of the Phase 1 survey for any respondent who opted into Phase 2. The raw locational data collected have been processed into identified "trips" and some of those trips into identified "trip chains." Data from Phase 1 and Phase 2 are linked by a unique respondent identifier. Anonymized versions of the Phase 1 and Phase 2 data are both available on Livewire.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes the anonymized qualitative data supporting the study titled "Exploring the implemented guidelines for dyslipidemia treatment and care among nurses and physicians in Jordanian hospitals". The materials include expert interview content and coded thematic summaries used in the qualitative analysis. All data have been de-identified to protect participant confidentiality
This dataset contains anonymized DICOM images acquired as part of a cardiac T1 mapping study using a 5T MRI system. All personal identifiers have been removed in compliance with DICOM de-identification standards and institutional ethics approval. The dataset includes pre- and post-contrast MOLLI sequences from healthy volunteers and patients. It is made publicly available for academic and non-commercial research purposes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MultiSocial is a dataset (described in a paper) for multilingual (22 languages) machine-generated text detection benchmark in social-media domain (5 platforms). It contains 472,097 texts, of which about 58k are human-written and approximately the same amount is generated by each of 7 multilingual large language models by using 3 iterations of paraphrasing. The dataset has been anonymized to minimize amount of sensitive data by hiding email addresses, usernames, and phone numbers.
If you use this dataset in any publication, project, tool or in any other form, please, cite the a paper.
Due to data source (described below), the dataset may contain harmful, disinformation, or offensive content. Based on a multilingual toxicity detector, about 8% of the text samples are probably toxic (from 5% in WhatsApp to 10% in Twitter). Although we have used data sources of older date (lower probability to include machine-generated texts), the labeling (of human-written text) might not be 100% accurate. The anonymization procedure might not successfully hiden all the sensitive/personal content; thus, use the data cautiously (if feeling affected by such content, report the found issues in this regard to dpo[at]kinit.sk). The intended use if for non-commercial research purpose only.
The human-written part consists of a pseudo-randomly selected subset of social media posts from 6 publicly available datasets:
Telegram data originated in Pushshift Telegram, containing 317M messages (Baumgartner et al., 2020). It contains messages from 27k+ channels. The collection started with a set of right-wing extremist and cryptocurrency channels (about 300 in total) and was expanded based on occurrence of forwarded messages from other channels. In the end, it thus contains a wide variety of topics and societal movements reflecting the data collection time.
Twitter data originated in CLEF2022-CheckThat! Task 1, containing 34k tweets on COVID-19 and politics (Nakov et al., 2022, combined with Sentiment140, containing 1.6M tweets on various topics (Go et al., 2009).
Gab data originated in the dataset containing 22M posts from Gab social network. The authors of the dataset (Zannettou et al., 2018) found out that “Gab is predominantly used for the dissemination and discussion of news and world events, and that it attracts alt-right users, conspiracy theorists, and other trolls.” They also found out that hate speech is much more prevalent there compared to Twitter, but lower than 4chan's Politically Incorrect board.
Discord data originated in Discord-Data, containing 51M messages. This is a long-context, anonymized, clean, multi-turn and single-turn conversational dataset based on Discord data scraped from a large variety of servers, big and small. According to the dataset authors, it contains around 0.1% of potentially toxic comments (based on the applied heuristic/classifier).
WhatsApp data originated in whatsapp-public-groups, containing 300k messages (Garimella & Tyson, 2018). The public dataset contains the anonymised data, collected for around 5 months from around 178 groups. Original messages were made available to us on request to dataset authors for research purposes.
From these datasets, we have pseudo-randomly sampled up to 1300 texts (up to 300 for test split and the remaining up to 1000 for train split if available) for each of the selected 22 languages (using a combination of automated approaches to detect the language) and platform. This process resulted in 61,592 human-written texts, which were further filtered out based on occurrence of some characters or their length, resulting in about 58k human-written texts.
The machine-generated part contains texts generated by 7 LLMs (Aya-101, Gemini-1.0-pro, GPT-3.5-Turbo-0125, Mistral-7B-Instruct-v0.2, opt-iml-max-30b, v5-Eagle-7B-HF, vicuna-13b). All these models were self-hosted except for GPT and Gemini, where we used the publicly available APIs. We generated the texts using 3 paraphrases of the original human-written data and then preprocessed the generated texts (filtered out cases when the generation obviously failed).
The dataset has the following fields:
'text' - a text sample,
'label' - 0 for human-written text, 1 for machine-generated text,
'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text,
'split' - a string identifying train or test split of the dataset for the purpose of training and evaluation respectively,
'language' - the ISO 639-1 language code identifying the detected language of the given text,
'length' - word count of the given text,
'source' - a string identifying the source dataset / platform of the given text,
'potential_noise' - 0 for text without identified noise, 1 for text with potential noise.
ToDo Statistics (under construction)
This dataset contains information about Bluetooth devices detected by our Bluetooth travel sensors. Each record contains a detected device’s anonymized Media Access Control (MAC) address along with the time and location the device was detected. These records alone are not traffic data but can be post-processed to measure the movement of detected devices through the roadway network How does the City of Austin use the Bluetooth travel sensor data? The data enables transportation engineers to better understand short and long-term trends in Austin’s traffic patterns, supporting decisions about systems planning and traffic signal timing. What information does the data contain? The sensor data is available in three datasets: Individual Address Records ( https://data.austintexas.gov/dataset/Bluetooth-Travel-Sensors-Individual-Addresses/qnpj-zrb9/data ) Each row in this dataset represents a Bluetooth device that was detected by one of our sensors. Each record contains a detected device’s anonymized Media Access Control (MAC) address along with the time and location the device was detected. These records alone are not traffic data but can be post-processed to measure the movement of detected devices through the roadway network Individual Traffic Matches ( https://data.austintexas.gov/dataset/Bluetooth-Travel-Sensors-Individual-Traffic-Matche/x44q-icha/data ) Each row in this dataset represents one Bluetooth enabled device that detected at two locations in the roadway network. Each record contains a detected device’s anonymized Media Access Control (MAC) address along with contain information about origin and destination points at which the device was detected, as well the time, date, and distance traveled. Traffic Summary Records ( https://data.austintexas.gov/dataset/Bluetooth-Travel-Sensors-Match-Summary-Records/v7zg-5jg9 ) The traffic summary records contain aggregate travel time and speed summaries based on the individual traffic match records. Each row in the dataset summarizes average travel time and speed along a sensor-equipped roadway segment in 15 minute intervals. Does this data contain personally identifiable information? No. The Media Access Control (MAC) addresses in these datasets are randomly generated.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset consists of a curated and anonymized collection of real job application confirmation emails from a Gmail inbox. It includes confirmation emails, rejection notices, and other relevant correspondences. The dataset was originally curated to address the challenge of eliminating manual job application tracking, allowing for automatic tracking directly from the inbox, capturing application confirmations and rejection notifications.
The dataset has been carefully pre-processed, cleaned, and enriched with derived features such as:
The dataset was originally curated to build a job application tracking agent that can automatically extract and track application updates—such as confirmations, rejections, interview invites, and assessment notifications—directly from the inbox. The goal was to enable users to easily interact with an AI assistant to analyze and manage their job search process more efficiently.
⚠️ Disclaimer: All personal identifiable information (PII) such as names and email addresses have been fully anonymized or redacted. This dataset is intended strictly for educational and research purposes. All personally identifiable information (PII) has been carefully anonymized. Any personal names found in the dataset have been replaced with the fictional name "Michael Gary Scott" as a placeholder. This character reference is used purely for fun and does not correspond to any real individual. Please ensure any further use of this dataset respects privacy and ethical data handling practices.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
About
Recent research shows that visualizing linguistic media bias mitigates its negative effects. However, reliable automatic detection methods to generate such visualizations require costly, knowledge-intensive training data. To facilitate data collection for media bias datasets, we present News Ninja, a game employing data-collecting game mechanics to generate a crowdsourced dataset. Before annotating sentences, players are educated on media bias via a tutorial. Our findings show that datasets gathered with crowdsourced workers trained on News Ninja can reach significantly higher inter-annotator agreements than expert and crowdsourced datasets. As News Ninja encourages continuous play, it allows datasets to adapt to the reception and contextualization of news over time, presenting a promising strategy to reduce data collection expenses, educate players, and promote long-term bias mitigation.
General
This dataset was created through player annotations in the News Ninja Game made by ANON. Its goal is to improve the detection of linguistic media bias. Support came from ANON. None of the funders played any role in the dataset creation process or publication-related decisions.
The dataset includes sentences with binary bias labels (processed, biased or not biased) as well as the annotations of single players used for the majority vote. It includes all game-collected data. All data is completely anonymous. The dataset does not identify sub-populations or can be considered sensitive to them, nor is it possible to identify individuals.
Some sentences might be offensive or triggering as they were taken from biased or more extreme news sources. The dataset contains topics such as violence, abortion, and hate against specific races, genders, religions, or sexual orientations.
Description of the Data Files
This repository contains the datasets for the anonymous News Ninja submission. The tables contain the following data:
ExportNewsNinja.csv: Contains 370 BABE sentences and 150 new sentences with their text (sentence), words labeled as biased (words), BABE ground truth (ground_Truth), and the sentence bias label from the player annotations (majority_vote). The first 370 sentences are re-annotated BABE sentences, and the following 150 sentences are new sentences.
AnalysisNewsNinja.xlsx: Contains 370 BABE sentences and 150 new sentences. The first 370 sentences are re-annotated BABE sentences, and the following 150 sentences are new sentences. The table includes the full sentence (Sentence), the sentence bias label from player annotations (isBiased Game), the new expert label (isBiased Expert), if the game label and expert label match (Game VS Expert), if differing labels are a false positives or false negatives (false negative, false positive), the ground truth label from BABE (isBiasedBABE), if Expert and BABE labels match (Expert VS BABE), and if the game label and BABE label match (Game VS BABE). It also includes the analysis of the agreement between the three rater categories (Game, Expert, BABE).
demographics.csv: Contains demographic information of News Ninja players, including gender, age, education, English proficiency, political orientation, news consumption, and consumed outlets.
Collection Process
Data was collected through interactions with the NewsNinja game. All participants went through a tutorial before annotating 2x10 BABE sentences and 2x10 new sentences. For this first test, players were recruited using Prolific. The game was hosted on a costume-built responsive website. The collection period was from 20.02.2023 to 28.02.2023. Before starting the game, players were informed about the goal and the data processing. After consenting, they could proceed to the tutorial.
The dataset will be open source. A link with all details and contact information will be provided upon acceptance. No third parties are involved.
The dataset will not be maintained as it captures the first test of NewsNinja at a specific point in time. However, new datasets will arise from further iterations. Those will be linked in the repository. Please cite the NewsNinja paper if you use the dataset and contact us if you're interested in more information or joining the project.
http://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html
The Counter-Trafficking Data Collaborative is the first global data hub on human trafficking, publishing harmonized data from counter-trafficking organizations around the world. Launched in November 2017, the goal of CTDC is to break down information-sharing barriers and equip the counter-trafficking community with up to date, reliable data on human trafficking.
The CTDC global victim of trafficking dataset is the largest of its kind in the world, and currently exists in two forms. The data are based on case management data, gathered from identified cases of human trafficking, disaggregated at the level of the individual. The cases are recorded in a case management system during the provision of protection and assistance services, or are logged when individuals contact a counter-trafficking hotline. The number of observations in the dataset increases as new records are added by the contributing organizations. The global victim of trafficking dataset that is available to download from the website in csv format has been mathematically anonymized, and the complete, non k-anonymized version of the dataset is displayed throughout the website through visualizations and charts showing detailed analysis.
The data come from a variety of sources. The data featured in the global victim of trafficking dataset come from the assistance activities of the contributing organizations, including from case management services and from counter-trafficking hotline logs.
Each dataset has been created through a process of comparing and harmonizing existing data models of contributing partners and data classification systems. Initial areas of compatibility were identified to create a unified system for organizing and mapping data to a single standard. Each contributing organization transforms its data to this shared standard and any identifying information is removed before the datasets are made available.
Counter-trafficking case data contains highly sensitive information, and maintaining privacy and confidentiality is of paramount importance for CTDC. For example, all explicit identifiers, such as names, were removed from the global victim dataset and some data such as age has been transformed into age ranges. No personally identifying information is transferred to or hosted by CTDC, and organizations that want to contribute are asked to anonymize in accordance to the standards set by CTDC.
In addition to the safeguard measures outlined in step 1 the global victim dataset has been anonymized to a higher level, through a mathematical approach called k-anonymization. For a full description of k-anonymization, please refer to the definitions page.
IOM collects and processes data in accordance to its own Data Protection Policy. The other contributors adhere to relevant national and international standards through their policies for collecting and processing personal data.
These data reflect the victims assisted/identified/referred/reported to the contributing organizations, which may not represent all victims identified within a country. Nevertheless, the larger the sample size for a given country (or, the more victims displayed on the map for a given country), the more representative the data are likely to be of the identified victim of trafficking population.
A larger number of identified victims of trafficking does not imply that there is a larger number of undetected victims of trafficking (i.e. a higher prevalence of trafficking).
In addition, samples of identified victims of trafficking cannot be considered random samples of the wider population of victims of trafficking (which includes unidentified victims), since counter-trafficking agencies may be more likely to identify some trafficking cases rather than others. However, with this caveat in mind, the profile of identified victims of trafficking tends to be considered as indicative of the profile of the wider population, given that the availability of other data sources is close to zero.
There are currently no global or regional estimates of the prevalence of human trafficking. National estimates have been conducted in a few countries but they are also based on modelling of existing administrative data from identified cases and should therefore only be considered as basic baseline estimates. Historically, producing estimates of the prevalence of trafficking based on the collection of new primary data through surveys, for example, has been difficult. This is due to trafficking’s complicated legal definition and the challenges of a...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This updated labeled dataset builds upon the initial systematic review by van de Schoot et al. (2018; DOI: 10.1080/00273171.2017.1412293), which included studies on post-traumatic stress symptom (PTSS) trajectories up to 2016, sourced from the Open Science Framework (OSF). As part of the FORAS project - Framework for PTSS trajectORies: Analysis and Synthesis (funded by the Dutch Research Council, grant no. 406.22.GO.048 and pre-registered at PROSPERO under ID CRD42023494027), we extended this dataset to include publications between 2016 and 2023. In total, the search identified 10,594 de-duplicated records obtained via different search methods, each published with their own search query and result: Exact replication of the initial search: OSF.IO/QABW3 Comprehensive database search: OSF.IO/D3UV5 Snowballing: OSF.IO/M32TS Full-text search via Dimensions data: OSF.IO/7EXC5 Semantic search via OpenAlex: OSF.IO/M32TS Humans (BC, RN) and AI (Bron et al., 2024) have screened the records, and disagreements have been solved (MvZ, BG, RvdS). Each record was screened separately for Title, Abstract, and Full-text inclusion and per inclusion criteria. A detailed screening logbook is available at OSF.IO/B9GD3, and the entire process is described in https://doi.org/10.31234/osf.io/p4xm5. A description of all columns/variables and full methodological details is available in the accompanying codebook. Important Notes: Duplicates: To maintain consistency and transparency, duplicates are left in the dataset and are labeled with the same classification as the original records. A filter is provided to allow users to exclude these duplicates as needed. Anonymized Data: The dataset "...._anonymous" excludes DOIs, OpenAlex IDs, titles, and abstracts to ensure data anonymization during the review process. The complete dataset, including all identifiers, is uploaded under embargo and will be publicly available on 01-10-2025. This dataset serves not only as a valuable resource for researchers interested in systematic reviews of PTSS trajectories and facilitates reproducibility and transparency in the research process but also for data scientists who would like to mimic the screening process using different machine learning and AI models.
Abstract: This dataset was created as part of the TAPIR project to identify internal ORCID coverage at TIB – Leibniz Information Centre for Science and Technology Hannover — and to investigate external ORCID Coverage/Intersection in selected external open data sources (FREYA, ORCID, OpenAlex). Only researchers employed at TIB in June 2021 are considered (93 persons in total). Personal data was anonymized for further processing via an internally resolvable identifier (primary key). The present dataset is the result of mapping and anonymized aggregation of the internally compiled list and externally queried lists from FREYA, ORCID, and OpenAlex generated via the query tool "pidgraph-notebooks" (https://doi.org/10.5281/zenodo.6373245). The context of the data collection and analysis is described in more detail in the related publication.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset supports the doctoral dissertation The AI of the Beholder: A Quantitative Study on Human Perception and Appraisal of AI-Generated Images by Joshua Cunningham (Robert Morris University, 2025). The study investigates how individuals perceive and appraise artwork generated by artificial intelligence (AI) in comparison to human-created pieces. Specifically, it examines: (1) whether participants can accurately distinguish AI-generated from human-created artwork, (2) how age and exposure to AI art influence this ability and related appraisals, and (3) how digital versus traditional visual styles of AI art are perceived. The dataset includes anonymized survey responses collected from a diverse group of adult participants. Respondents were asked to evaluate a series of visual artworks—some created by humans, others by AI—across a range of styles, including both digital and traditional aesthetics. Additional demographic information such as age and prior exposure to AI tools was collected to assess moderating effects. The data were analyzed using SPSS to evaluate participant accuracy, preferences, and perceptions. This dataset can support further research into the psychological, aesthetic, and cultural dynamics of AI-generated content, as well as human-machine interaction in the creative arts.You may find the images used in this study, the original survey instrument, as well as a legend detailing each of the variables here: https://drive.google.com/drive/folders/125oaW82HpJUjI7EbaQWk_51puz0DgWKc?usp=sharing
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
As of 2023, the global Data De-Identification or Pseudonymity Software market is valued at approximately USD 1.5 billion and is projected to grow at a robust CAGR of 18% from 2024 to 2032, driven by increasing data privacy concerns and stringent regulatory requirements.
The growth of the Data De-Identification or Pseudonymity Software market is primarily fueled by the exponential increase in data generation across industries. With the advent of IoT, AI, and digital transformation strategies, the volume of data generated has seen an unprecedented spike. Organizations are now more aware of the need to protect sensitive information to comply with global data privacy regulations such as GDPR in Europe and CCPA in California. The need to ensure that personal data is anonymized or de-identified before analysis or sharing has escalated, pushing the demand for these software solutions.
Another significant growth factor is the rising number of cyber-attacks and data breaches. As data becomes more valuable, it also becomes a prime target for cybercriminals. In response, companies are investing heavily in data privacy and security measures, including de-identification and pseudonymity solutions, to mitigate risks associated with data breaches. This trend is more prevalent in sectors dealing with highly sensitive information like healthcare, finance, and government. Ensuring that data remains secure and private while being useful for analytics is a key driver for the adoption of these technologies.
Moreover, the evolution of Big Data analytics and cloud computing is also spurring growth in this market. As organizations move their operations to the cloud and leverage big data for decision-making, the importance of maintaining data privacy while utilizing large datasets for analytics cannot be overstated. Cloud-based de-identification solutions offer scalability, flexibility, and cost-effectiveness, making them increasingly popular among enterprises of all sizes. This shift towards cloud deployments is expected to further boost market growth.
Regionally, North America holds the largest market share due to its advanced technological infrastructure and stringent data protection laws. The presence of major technology companies and a high rate of adoption of advanced solutions in the U.S. and Canada contribute significantly to regional market growth. Europe follows closely, driven by rigorous GDPR compliance requirements. The Asia Pacific region is anticipated to witness the fastest growth, attributed to the increasing digitization and growing awareness about data privacy in countries like India and China.
As organizations increasingly seek to protect their sensitive data, the concept of Data Protection on Demand is gaining traction. This model allows businesses to access data protection services as and when needed, providing flexibility and scalability. By leveraging cloud-based platforms, companies can implement robust data protection measures without the need for significant upfront investments in infrastructure. This approach not only ensures compliance with data privacy regulations but also offers a cost-effective solution for managing data security. As the demand for on-demand services continues to rise, Data Protection on Demand is poised to become a critical component of data management strategies across various industries.
The Data De-Identification or Pseudonymity Software market by component is segmented into software and services. The software segment dominates the market, driven by the increasing need for automated solutions that ensure data privacy. These software solutions come with a variety of tools and features designed to anonymize or pseudonymize data efficiently, making them essential for organizations managing large volumes of sensitive information. The software market is expanding rapidly, with new innovations and improvements constantly being introduced to enhance functionality and user experience.
The services segment, though smaller compared to software, plays a crucial role in the market. Services include consulting, implementation, and maintenance, which are essential for the successful deployment and operation of de-identification software. These services help organizations tailor the software to their specific needs, ensuring compliance with regional and industry-specific data protection regulations.