28 datasets found

Geospatial and Information Substitution and Anonymization Tool (GISA)
osti.gov
Updated Jul 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NETL (2023). Geospatial and Information Substitution and Anonymization Tool (GISA) [Dataset]. http://doi.org/10.18141/1992880
Explore at:
Unique identifier
https://doi.org/10.18141/1992880
Dataset updated
Jul 31, 2023
Dataset provided by
National Energy Technology Laboratoryhttps://netl.doe.gov/
USDOE Office of Fossil Energy (FE)
Description
The Geospatial and Information Substitution and Anonymization Tool (GISA) incorporates techniques for obfuscating identifiable information from point data or documents, while simultaneously maintaining chosen variables to enable future use and meaningful analysis. This approach promotes collaboration and data sharing while also reducing the risk of exposure to sensitive information. GISA can be used in a number of different ways, including the anonymization of point spatial data, batch replacement/removal of user-specified terms from file names and from within file content, and aid with the selection and redaction of images and terms based on recommendations using natural language processing. Version 1 of the tool, published here, has updated functionality and enhanced capabilities to the beta version published in 2023. Please see User Documentation for further information on capabilities, as well as a guide for how to download and use the tool. If there are any feedback you would like to provide for the tool, please reach out with your feedback to edxsupport@netl.doe.gov. Disclaimer: This project was funded by the United States Department of Energy, National Energy Technology Laboratory, in part, through a site support contract. Neither the United States Government nor any agency thereof, nor any of their employees, nor the support contractor, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof. The Geospatial and Information Substitution and Anonymization Tool (GISA) was developed jointly through the U.S. DOE Office of Fossil Energy and Carbon Management’s EDX4CCS Project, in part, from the Bipartisan Infrastructure Law.
D
Data Masking Tools Report
archivemarketresearch.com
doc, pdf, ppt
Updated Jun 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). Data Masking Tools Report [Dataset]. https://www.archivemarketresearch.com/reports/data-masking-tools-560706
Explore at:
pdf, doc, pptAvailable download formats
Dataset updated
Jun 21, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global market for data masking tools is experiencing robust growth, driven by increasing regulatory compliance needs (like GDPR and CCPA), the rising adoption of cloud computing, and the expanding volume of sensitive data requiring protection. The market, currently estimated at $2.5 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033. This growth is fueled by organizations' increasing focus on data security and privacy, particularly within sectors like healthcare, finance, and government. The demand for sophisticated data masking solutions that can effectively anonymize and pseudonymize data while maintaining data utility for testing and development is a significant driver. Furthermore, the shift towards cloud-based data masking solutions, offering scalability and ease of management, is contributing to market expansion. Several key trends are shaping the market. The integration of advanced technologies such as AI and machine learning into data masking tools is enhancing their effectiveness and automating complex masking processes. The emergence of data masking solutions designed for specific data types, such as personally identifiable information (PII) and financial data, caters to niche requirements. However, challenges such as the complexity of implementing and managing data masking solutions, and concerns about the potential impact on data usability, represent restraints on market growth. The market is segmented by deployment type (cloud, on-premises), organization size (small, medium, large enterprises), and industry vertical (healthcare, finance, etc.). Key players in this space include Oracle, Delphix, BMC Software, Informatica, IBM, and several other specialized vendors offering a range of solutions to meet diverse organizational needs. The competitive landscape is dynamic, with ongoing innovation and consolidation shaping the future of the market.
D
Data Creation Tool Report
datainsightsmarket.com
doc, pdf, ppt
Updated Jun 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Data Creation Tool Report [Dataset]. https://www.datainsightsmarket.com/reports/data-creation-tool-492424
Explore at:
ppt, pdf, docAvailable download formats
Dataset updated
Jun 28, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Data Creation Tool market, currently valued at $7.233 billion (2025), is experiencing robust growth, projected to expand at a Compound Annual Growth Rate (CAGR) of 18.2% from 2025 to 2033. This significant expansion is driven by the increasing need for high-quality synthetic data across various sectors, including software development, machine learning, and data analytics. Businesses are increasingly adopting these tools to accelerate development cycles, improve data testing and validation processes, and enhance the training and performance of AI models. The rising demand for data privacy and regulatory compliance further fuels this growth, as synthetic data offers a viable alternative to real-world data while preserving sensitive information. Key players like Informatica, Broadcom (with its EDMS solutions), and Delphix are leveraging their established positions in data management to capture significant market share. Emerging players like Keymakr and Mostly AI are also contributing to innovation with specialized solutions focusing on specific aspects of data creation, such as realistic data generation and streamlined workflows. The market segmentation, while not explicitly provided, can be logically inferred. We can anticipate segments based on deployment (cloud, on-premise), data type (structured, unstructured), industry vertical (financial services, healthcare, retail), and functionality (data generation, data masking, data anonymization). Competitive dynamics are shaping the market with established players facing pressure from innovative startups. The forecast period of 2025-2033 indicates a substantial market expansion opportunity, influenced by factors like advancements in AI/ML technologies that demand massive datasets, and the growing adoption of Agile and DevOps methodologies in software development, both of which rely heavily on efficient data creation tools. Understanding specific regional breakdowns and further market segmentation is crucial for developing targeted business strategies and accurately assessing investment potential.
Healthcare Data Anonymization Services Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Jun 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Healthcare Data Anonymization Services Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/healthcare-data-anonymization-services-market
Explore at:
pptx, csv, pdfAvailable download formats
Dataset updated
Jun 27, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Healthcare Data Anonymization Services Market Outlook

According to our latest research, the global healthcare data anonymization services market size reached USD 1.42 billion in 2024, reflecting a robust expansion driven by increasing regulatory demands and heightened focus on patient privacy. The market is projected to grow at a CAGR of 15.8% from 2025 to 2033, with the total market value expected to reach USD 5.44 billion by 2033. This impressive growth trajectory is underpinned by the rising adoption of digital health solutions, stringent data protection laws, and the ongoing digitalization of healthcare records worldwide.

The primary growth factor fueling the healthcare data anonymization services market is the proliferation of electronic health records (EHRs) and the expanding use of big data analytics in healthcare. As healthcare providers and organizations increasingly leverage advanced analytics for improving patient outcomes, there is a corresponding surge in data generation. However, these vast datasets often contain sensitive patient information, making data anonymization essential to ensure compliance with regulations such as HIPAA, GDPR, and other regional privacy laws. The increasing frequency of data breaches and cyberattacks has further highlighted the importance of robust anonymization services, prompting healthcare organizations to prioritize investments in data privacy and security solutions. As a result, demand for both software and service-based anonymization solutions continues to rise, contributing significantly to market growth.

Another key driver for the healthcare data anonymization services market is the growing emphasis on research and clinical trials, which require the sharing and analysis of large volumes of patient data. Pharmaceutical and biotechnology companies, as well as research organizations, are increasingly collaborating across borders, necessitating the anonymization of datasets to protect patient identities and comply with international data protection standards. The adoption of cloud-based healthcare solutions has also facilitated the secure and efficient sharing of anonymized data, supporting advancements in personalized medicine and population health management. As organizations seek to balance innovation with compliance, the demand for advanced anonymization technologies that offer high accuracy and scalability is expected to accelerate further.

Technological advancements in artificial intelligence (AI) and machine learning (ML) are also shaping the future of the healthcare data anonymization services market. These technologies are enabling more sophisticated and automated anonymization processes, reducing the risk of re-identification while maintaining data utility for research and analytics. The integration of AI-driven tools into anonymization workflows is helping organizations streamline operations, minimize human error, and achieve greater compliance with evolving regulatory requirements. Additionally, the increasing availability of customizable and interoperable anonymization solutions is making it easier for healthcare organizations of all sizes to adopt and scale these services, thereby broadening the market’s reach and impact.

From a regional perspective, North America continues to dominate the healthcare data anonymization services market, accounting for the largest share in 2024. This leadership position is attributed to the presence of advanced healthcare infrastructure, widespread adoption of EHRs, and strict regulatory frameworks governing patient data privacy. Europe follows closely, driven by the enforcement of the General Data Protection Regulation (GDPR) and a strong culture of data protection. The Asia Pacific region is witnessing the fastest growth, propelled by increasing healthcare digitalization, government initiatives to modernize healthcare systems, and rising awareness of data privacy among patients and providers. Latin America and the Middle East & Africa are also experiencing steady growth, albeit from a smaller base, as healthcare organizations in these regions begin to prioritize data security and compliance.

&
s
PRIEST study anonymised dataset
orda.shef.ac.uk
figshare.shef.ac.uk
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benjamin Thomas; Laura Sutton; Steve Goodacre; Katie Biggs; Amanda Loban (2023). PRIEST study anonymised dataset [Dataset]. http://doi.org/10.15131/shef.data.13194845.v1
Explore at:
Unique identifier
https://doi.org/10.15131/shef.data.13194845.v1
Dataset updated
May 30, 2023
Dataset provided by
The University of Sheffield
Authors
Benjamin Thomas; Laura Sutton; Steve Goodacre; Katie Biggs; Amanda Loban
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The PRIEST study used patient data from the early phases of the COVID-19 pandemic. The PRIEST study provided descriptive statistics of UK patients with suspected COVID-19 in an emergency department cohort, analysis of existing triage tools, and derivation and validation of a COVID-19 specific tool for adults with suspected COVID-19. For more details please go to the study website:https://www.sheffield.ac.uk/scharr/research/centres/cure/priestFiles contained in PRIEST study data repository Main files include:PRIEST.csv dataset contains 22445 observations and 119 variables. Data include initial presentation and follow-up, one row per participant.PRIEST_variables.csv contains variable names, values and brief description.Additional files include:Follow-up v4.0 PDF - Blank 30-day follow-up data collection toolPandemic Respiratory Infection Form v7 PDF - Blank baseline data collection toolPRIEST protocol v11.0_17Aug20 PDF - Study protocolPRIEST_SAP_v1.0_19jun20 PDF - Statistical analysis planThe PRIEST data sharing plan follows a controlled access model as described in Good Practice Principles for Sharing Individual Participant Data from Publicly Funded Clinical Trials. Data sharing requests should be emailed to priest-study@sheffield.ac.uk. Data sharing requests will be considered carefully as to whether it is necessary to fulfil the purpose of the data sharing request. For approval of a data sharing request an approved ethical review and study protocol must be provided. The PRIEST study was approved by NRES Committee North West - Haydock. REC reference: 12/NW/0303
e
Anonymize or Synthesize? – Privacy-Preserving Methods for Heart Failure...
b2find.eudat.eu
Updated Nov 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Anonymize or Synthesize? – Privacy-Preserving Methods for Heart Failure Score Analytics [data] - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/cfd3d0d1-cfa8-502e-bda0-4730a047ea81
Explore at:
Dataset updated
Nov 20, 2024
Description
In the publication [1] we implemented anonymization and synthetization techniques for a structured data set, which was collected during the HiGHmed Use Case Cardiology study [2]. We employed the data anonymization tool ARX [3] and the data synthetization framework ASyH [4] individually and in combination. We evaluated the utility and shortcomings of the different approaches by statistical analyses and privacy risk assessments. Data utility was assessed by computing two heart failure risk scores (Barcelona BioHF [5] and MAGGIC [6]) on the protected data sets. We observed only minimal deviations to scores from the original data set. Additionally, we performed a re-identification risk analysis and found only minor residual risks for common types of privacy threats. We could demonstrate that anonymization and synthetization methods protect privacy while retaining data utility for heart failure risk assessment. Both approaches and a combination thereof introduce only minimal deviations from the original data set over all features. While data synthesis techniques produce any number of new records, data anonymization techniques offer more formal privacy guarantees. Consequently, data synthesis on anonymized data further enhances privacy protection with little impacting data utility. We hereby share all generated data sets with the scientific community through a use and access agreement.
D
Data De-Identification Tools Report
datainsightsmarket.com
doc, pdf, ppt
Updated Jun 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Data De-Identification Tools Report [Dataset]. https://www.datainsightsmarket.com/reports/data-de-identification-tools-529560
Explore at:
ppt, pdf, docAvailable download formats
Dataset updated
Jun 29, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The market for data de-identification tools is experiencing robust growth, driven by increasing regulatory scrutiny around data privacy (like GDPR and CCPA), the rising volume of sensitive data being generated and processed, and a growing awareness of the potential risks associated with data breaches. The market, estimated at $2 billion in 2025, is projected to experience a Compound Annual Growth Rate (CAGR) of 15% between 2025 and 2033, reaching an estimated $7 billion by 2033. This expansion is fueled by the adoption of advanced techniques like differential privacy and homomorphic encryption, allowing organizations to derive insights from data while safeguarding individual privacy. Key trends include the increasing demand for integrated solutions that combine data de-identification with other data security measures, a shift towards cloud-based solutions for enhanced scalability and accessibility, and the growing adoption of AI and machine learning for automating data de-identification processes. However, challenges remain, including the complexity of implementing de-identification techniques, concerns around the accuracy and effectiveness of these tools, and the ongoing evolution of privacy regulations requiring continuous adaptation. The market is highly competitive, with a range of established players and emerging startups vying for market share. This competitive landscape encompasses both large multinational corporations like IBM and Salesforce, offering comprehensive data management and security platforms, and smaller, more specialized companies such as PrivacyOne and Very Good Security, focusing on specific de-identification techniques and data protection solutions. The diverse range of solutions reflects the nuanced requirements across different industries and data types. The segment breakdown likely includes solutions tailored to healthcare, finance, and other sectors with stringent privacy regulations. Geographic distribution will likely show stronger market penetration in regions with robust data protection regulations and a strong emphasis on digital transformation, such as North America and Europe. Continued innovation in areas such as federated learning and privacy-enhancing technologies will further shape the trajectory of this rapidly evolving market.
Data from: Extracted and Anonymised Qualitative Data on Students' Acceptance...
zenodo.org
recerca.uoc.edu
bin
Updated Jul 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juliana Elisa Raffaghelli; Juliana Elisa Raffaghelli; Eugenia Loria-Soriano; M. Elena Rodríguez González; M. Elena Rodríguez González; David Bañeres; David Bañeres; Ana Elena Guerrero-Roldán; Ana Elena Guerrero-Roldán; Eugenia Loria-Soriano (2022). Extracted and Anonymised Qualitative Data on Students' Acceptance of an Early Warning System [Dataset]. http://doi.org/10.5281/zenodo.6841130
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6841130
Dataset updated
Jul 16, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juliana Elisa Raffaghelli; Juliana Elisa Raffaghelli; Eugenia Loria-Soriano; M. Elena Rodríguez González; M. Elena Rodríguez González; David Bañeres; David Bañeres; Ana Elena Guerrero-Roldán; Ana Elena Guerrero-Roldán; Eugenia Loria-Soriano
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data published in this record was adopted in the following study:

Exploring Higher Education students' experience with AI-powered educational tools: The case of an Early Warning System

The study analyses the students' experience of an early warning system developed at a fully online university. The study is based on 21 semi-structured interviews that yielded a corpus of 21,761 words, for which a mixed inductive and deductive codification approach was applied after thematic analysis. We focused on 11 themes, 52 subthemes, and 396 coded segments to perform content analysis. Our findings revealed that the students, primarily senior workers with a high-level academic self-efficacy, had little experience with this type of system and low expectations about it. However, a usage experience triggered interest and meaningful reflections on the mentioned tool. Nevertheless, a comparative analysis between disciplines related to Computer Science and Economics showed higher confidence and expectation about the system and artificial intelligence overall by the first group. These results highlight the relevance of supporting students' further experiences and understanding of artificial intelligence systems in education to accept them and mainly to participate in iterative development processes of such tools to achieve quality, relevance, and fairness.

The three records attached as part of the dataset include:

1- The General CodeTree with exemplar coding excerpts in Spanish
2- Extract of transcriptions in English
3- Full Report in Spanish as extracted from NVIVO, including the extracted codes for the synthesis (1,2) in blue, and the comments made by the two researchers engaged in the interrater agreement.
4- General Content Analysis (Spreadsheet ODS)
h
Anonymize or Synthesize? – Privacy-Preserving Methods for Heart Failure...
heidata.uni-heidelberg.de
pdf, tsv, txt
Updated Nov 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tim Ingo Johann; Tim Ingo Johann; Karen Otte; Karen Otte; Fabian Prasser; Fabian Prasser; Christoph Dieterich; Christoph Dieterich (2024). Anonymize or Synthesize? – Privacy-Preserving Methods for Heart Failure Score Analytics [data] [Dataset]. http://doi.org/10.11588/DATA/MXM0Q2
Explore at:
txt(3421), tsv(191831), tsv(106632), tsv(286102), tsv(107100), tsv(190296), tsv(197975), pdf(640128)Available download formats
Unique identifier
https://doi.org/10.11588/DATA/MXM0Q2
Dataset updated
Nov 20, 2024
Dataset provided by
heiDATA
Authors
Tim Ingo Johann; Tim Ingo Johann; Karen Otte; Karen Otte; Fabian Prasser; Fabian Prasser; Christoph Dieterich; Christoph Dieterich
License
https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.11588/DATA/MXM0Q2https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.11588/DATA/MXM0Q2
Description
In the publication [1] we implemented anonymization and synthetization techniques for a structured data set, which was collected during the HiGHmed Use Case Cardiology study [2]. We employed the data anonymization tool ARX [3] and the data synthetization framework ASyH [4] individually and in combination. We evaluated the utility and shortcomings of the different approaches by statistical analyses and privacy risk assessments. Data utility was assessed by computing two heart failure risk scores (Barcelona BioHF [5] and MAGGIC [6]) on the protected data sets. We observed only minimal deviations to scores from the original data set. Additionally, we performed a re-identification risk analysis and found only minor residual risks for common types of privacy threats. We could demonstrate that anonymization and synthetization methods protect privacy while retaining data utility for heart failure risk assessment. Both approaches and a combination thereof introduce only minimal deviations from the original data set over all features. While data synthesis techniques produce any number of new records, data anonymization techniques offer more formal privacy guarantees. Consequently, data synthesis on anonymized data further enhances privacy protection with little impacting data utility. We hereby share all generated data sets with the scientific community through a use and access agreement. [1] Johann TI, Otte K, Prasser F, Dieterich C: Anonymize or synthesize? Privacy-preserving methods for heart failure score analytics. Eur Heart J 2024;. doi://10.1093/ehjdh/ztae083 [2] Sommer KK, Amr A, Bavendiek, Beierle F, Brunecker P, Dathe H et al. Structured, harmonized, and interoperable integration of clinical routine data to compute heart failure risk scores. Life (Basel) 2022;12:749. [3] Prasser F, Eicher J, Spengler H, Bild R, Kuhn KA. Flexible data anonymization using ARX—current status and challenges ahead. Softw Pract Exper 2020;50:1277–1304. [4] Johann TI, Wilhelmi H. ASyH—anonymous synthesizer for health data, GitHub, 2023. Available at: https://github.com/dieterich-lab/ASyH. [5] Lupón J, de Antonio M, Vila J, Peñafiel J, Galán A, Zamora E, et al. Development of a novel heart failure risk tool: the Barcelona bio-heart failure risk calculator (BCN Bio-HF calculator). PLoS One 2014;9:e85466. [6] Pocock SJ, Ariti CA, McMurray JJV, Maggioni A, Køber L, Squire IB, et al. Predicting survival in heart failure: a risk score based on 39 372 patients from 30 studies. Eur Heart J 2013;34:1404–1413.

802.11 Managemement frames from a public location

zenodo.org
data.niaid.nih.gov

zip

Updated Apr 24, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Benjamin Vermunicht; Benjamin Vermunicht (2025). 802.11 Managemement frames from a public location [Dataset]. http://doi.org/10.5281/zenodo.8003772

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.8003772

Dataset updated

Apr 24, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Benjamin Vermunicht; Benjamin Vermunicht

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

About

The following datasets were captured at a busy Belgian train station between 9pm and 10pm, it contains all 802.11 management frames that were captured. both datasets were captured with approximately 20 minutes between then.

Both datasets are represented by a pcap and CSV file. The CSV file contains the frame type, timestamps, signal strength, SSID and MAC addresses for every frame. In the pcap file, all generic 802.11 elements were removed for anonymization purposes.

Anonymization

All frames were anonymized by removing identifying information or renaming identifiers. Concretely, the following transformations were applied to both datasets:

All MAC addresses were renamed (e.g. 00:00:00:00:00:01)
All SSID's were renamed (e.g. NETWORK_1)
All generec 802.11 elements were removed from the pcap

In the pcap file, anonymization actions could lead to "corrupted" frames because length tags do not correspond with the actual data. However, the file and its frames are still readable in packet analyzing tools such as Wireshark or Scapy.

The script which was used to anonymize is available in the dataset.

Data

Specifications for the datasets
N/o	Dataset 1	dataset 2
Frames	36306	60984
Beacon frames	19693	27983
Request frames	798	1580
Response frames	15815	31421
Identified Wi-Fi Networks	54	70
Identified MAC addresses	2092	2705
Identified Wireless devices	128	186
Capturetime	480s	422s

Dataset contents

The two datasets are stored in the directories `1/` and `2/`. Each directory contains:

`capture-X.pcap`: an anonymized version of the original capture
`capture-X.csv`: content of each captured frame (timestamp, MAC address...) saved as a CSV file

`anonymization.py` is the script which was used to remove identifiers.

`README.md` contains the documentation about the datasets

License

Permission is hereby granted, free of charge, to any person obtaining a copy of this dataset and associated documentation files (the “Dataset”), to deal in the Dataset without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Dataset, and to permit persons to whom the Dataset is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions that make use of the Dataset.

THE DATASET IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE DATASET OR THE USE OR OTHER DEALINGS IN THE DATASET.

Data from: Mobilising Voluntary Action in the Four UK Jurisdictions:...
beta.ukdataservice.ac.uk
Updated 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alasdair Rutherford (2022). Mobilising Voluntary Action in the Four UK Jurisdictions: Anonymised User Data from Digital Volunteer Matching Tools, 2020-2021 [Dataset]. http://doi.org/10.5255/ukda-sn-855697
Explore at:
Unique identifier
https://doi.org/10.5255/ukda-sn-855697
Dataset updated
2022
Dataset provided by
UK Data Servicehttps://ukdataservice.ac.uk/
DataCitehttps://www.datacite.org/
Authors
Alasdair Rutherford
Area covered
United Kingdom
Description
The data collections consists of two sets of anonymised user data, containing location of volunteering activity, socio-demographic factors and efficacy factors e.g. time taken to onboard volunteers, speed of deployment, number of deployments will be created for status on day one of months March 2020 – March 2021. Data tables will be in CSV.
FAIRsharing record for: Integrating Data for Analysis, Anonymization, and...
search.datacite.org
fairsharing.org
Updated 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FAIRsharing Team (2015). FAIRsharing record for: Integrating Data for Analysis, Anonymization, and Sharing (iDASH) [Dataset]. http://doi.org/10.25504/fairsharing.k81521
Explore at:
Unique identifier
https://doi.org/10.25504/fairsharing.k81521
Dataset updated
2015
Dataset provided by
DataCitehttps://www.datacite.org/
FAIRsharing
Authors
FAIRsharing Team
Description
This FAIRsharing record describes: Integrating Data for Analysis, Anonymization and SHaring (iDASH) is one of the National Centers for Biomedical Computing (NCBC) under the NIH Roadmap for Bioinformatics and Computational Biology. Founded in 2010, the iDASH center is hosted on the campus of the University of California, San Diego and addresses fundamental challenges to research progress and enables global collaborations anywhere and anytime. Driving biological projects motivate, inform, and support tool development in iDASH. iDASH collaborates with other NCBCs and disseminates tools via annual workshops, presentations at major conferences, and scientific publications. iDASH offers a secure cyberinfrastructure and tools to support a privacy-preserving data repository and open source software. iDASH also is active in research and training in its mission area.
S
SAP Selective Test Data Management Tools Report
marketresearchforecast.com
doc, pdf, ppt
Updated Mar 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Research Forecast (2025). SAP Selective Test Data Management Tools Report [Dataset]. https://www.marketresearchforecast.com/reports/sap-selective-test-data-management-tools-38799
Explore at:
pdf, ppt, docAvailable download formats
Dataset updated
Mar 17, 2025
Dataset authored and provided by
Market Research Forecast
License
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The market for SAP Selective Test Data Management Tools is experiencing robust growth, driven by increasing regulatory compliance needs, the expanding adoption of agile and DevOps methodologies, and the rising demand for faster and more efficient software testing processes. The market size in 2025 is estimated at $1.5 billion, projecting a Compound Annual Growth Rate (CAGR) of 12% from 2025 to 2033. This growth is fueled by the increasing complexity of SAP systems and the associated challenges in managing test data effectively. Large enterprises are the primary adopters of these tools, representing a significant portion of the market share, followed by medium-sized and small enterprises. The cloud-based deployment model is gaining traction due to its scalability, cost-effectiveness, and ease of access, surpassing on-premises solutions in growth rate. Key players like SAP, Informatica, and Qlik are actively shaping the market through continuous product innovation and strategic partnerships. However, challenges remain, including the high initial investment costs associated with implementing these tools, the need for specialized expertise, and data security concerns. The geographic distribution reveals North America as a dominant region, followed by Europe and Asia Pacific. Growth in the Asia Pacific region is anticipated to be particularly strong, driven by increasing digitalization and the expanding adoption of SAP solutions across various industries. The competitive landscape is marked by both established vendors and emerging players, leading to increased innovation and a wider array of solutions to meet diverse customer needs. The market is expected to continue its trajectory of growth, driven by factors such as the increasing adoption of cloud-based solutions, the growing demand for data masking and anonymization techniques, and the rising emphasis on test data quality and compliance. Companies are actively seeking solutions that streamline their testing processes, reduce costs, and minimize risks associated with inadequate test data management.
v
Global Data De-Identification Or Pseudonymity Software Market Size By...
verifiedmarketresearch.com
pdf,excel,csv,ppt
Updated Jul 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Verified Market Research (2025). Global Data De-Identification Or Pseudonymity Software Market Size By Deployment Mode (On-Premises, Cloud-Based), By Application (Data Masking, Tokenization, Data Anonymization, Data Pseudonymization, Data Redaction), By End-User (Healthcare And Life Sciences, BFSI, Government And Public Sector, IT And Telecom, Retail And E-commerce), By Geographic Scope And Forecast [Dataset]. https://www.verifiedmarketresearch.com/product/data-deidentification-or-pseudonymity-software-market/
Explore at:
pdf,excel,csv,pptAvailable download formats
Dataset updated
Jul 7, 2025
Dataset authored and provided by
Verified Market Research
License
https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
Time period covered
2026 - 2032
Area covered
Global
Description
Data De-Identification Or Pseudonymity Software Market size was valued at USD 431.70 Million in 2024 and is projected to reach USD 595.38 Million by 2032, growing at a CAGR of 4.10% during the forecast period 2026 to 2032.The market drivers for the Data De-Identification Or Pseudonymity Software Market can be influenced by various factors. These may include:Increasing Data Privacy Regulations Worldwide: Strict data privacy laws such as GDPR and CCPA enforce hefty fines exceeding €1 Billion from 2018 to 2023. Compliance requires adoption of data de-identification tools to protect personal data and avoid regulatory penalties.Growing Number of Data Breaches and Cyberattacks: Over 45 Million healthcare records were exposed between 2019 and 2023, highlighting risks to sensitive data. Data de-identification is essential to minimize the impact of breaches and protect individuals’ privacy in affected sectors.
p
CARMEN-I: A resource of anonymized electronic health records in Spanish and...
physionet.org
Updated Apr 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eulalia Farre Maduell; Salvador Lima-Lopez; Santiago Andres Frid; Artur Conesa; Elisa Asensio; Antonio Lopez-Rueda; Helena Arino; Elena Calvo; Maria Jesús Bertran; Maria Angeles Marcos; Montserrat Nofre Maiz; Laura Tañá Velasco; Antonia Marti; Ricardo Farreres; Xavier Pastor; Xavier Borrat Frigola; Martin Krallinger (2024). CARMEN-I: A resource of anonymized electronic health records in Spanish and Catalan for training and testing NLP tools [Dataset]. http://doi.org/10.13026/x7ed-9r91
Explore at:
Unique identifier
https://doi.org/10.13026/x7ed-9r91
Dataset updated
Apr 20, 2024
Authors
Eulalia Farre Maduell; Salvador Lima-Lopez; Santiago Andres Frid; Artur Conesa; Elisa Asensio; Antonio Lopez-Rueda; Helena Arino; Elena Calvo; Maria Jesús Bertran; Maria Angeles Marcos; Montserrat Nofre Maiz; Laura Tañá Velasco; Antonia Marti; Ricardo Farreres; Xavier Pastor; Xavier Borrat Frigola; Martin Krallinger
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
The CARMEN-I corpus comprises 2,000 clinical records, encompassing discharge letters, referrals, and radiology reports from Hospital Clínic of Barcelona between March 2020 and March 2022. These reports, primarily in Spanish with some Catalan sections, cover COVID-19 patients with diverse comorbidities like kidney failure, cardiovascular diseases, malignancies, and immunosuppression. The corpus underwent thorough anonymization, validation, and expert annotation, replacing sensitive data with synthetic equivalents. A subset of the corpus features annotations of medical concepts by specialists, encompassing symptoms, diseases, procedures, medications, species, and humans (including family members). CARMEN-I serves as a valuable resource for training and assessing clinical NLP techniques and language models, aiding tasks like de-identification, concept detection, linguistic modifier extraction, document classification, and more. It also facilitates training researchers in clinical NLP and is a collaborative effort involving Barcelona Supercomputing Center's NLP4BIA team, Hospital Clínic, and Universitat de Barcelona's CLiC group.
g
Quantitative data from EDSA demand analysis
davetaz.github.io
csv
Updated Jun 29, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2016). Quantitative data from EDSA demand analysis [Dataset]. http://davetaz.github.io/quantitative-data-from-edsa-demand-analysis-/
Explore at:
csvAvailable download formats
Dataset updated
Jun 29, 2016
Time period covered
Feb 1, 2015 - Jan 31, 2018
Area covered
Europe
Description
This dataset provides the raw anonymised (quantitative) data from the EDSA demand analysis. This data has been gathered from surveys performed with those who identify as data scientists and manages of data scientists in different sectors across Europe. The coverage of the data includes level of current expertise of the individual or team (data scientist and manager respectively) in eight key areas. The dataset also includes the importance of the eight key areas as capabilities of a data scientist. Further the dataset includes a breakdown of key tools, technologies and training delivery methods required to enhance the skill set of data scientists across Europe. The EDSA dashboard provides an interactive view of this dataset and demonstrates how it is being used within the project. The dataset forms part of the European Data Science Academy (EDSA) project which received funding from the European Unions's Horizon 2020 research and innovation programme under grant agreement No 643937. This three year project ran/runs from February 2015 to January 2018. Important note on privacy: This dataset has been collected and made available in a pseudo anonymous way, as agreed by participants. This means that while each record represents a person, no sensitive identifiable information, such as name, email or affiliation is available (we don't even collect it). Pseudo anonymisation is never full proof, however the projects privacy impact assessment has concluded that the risk resulting from the de-anonymisation of the data is extremely low. It should be noted that data is not included of participants who did not explicitly agree that it could be shared pseudo anonymously (this was due to a change of terms after the survey had started gathering responses, meaning any early responses had come from people who didn't see this clause). If you have any concerns please contact the data publisher via the links below.
i
CRAWDAD tools/process/pcap/WiPal (v. 2010-01-13)
ieee-dataport.org
Updated Dec 16, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CRAWDAD Team (2022). CRAWDAD tools/process/pcap/WiPal (v. 2010-01-13) [Dataset]. https://ieee-dataport.org/open-access/crawdad-toolsprocesspcapwipal-v-2010-01-13
Explore at:
Dataset updated
Dec 16, 2022
Authors
CRAWDAD Team
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ease of use
Enrico's Email Flows
kaggle.com
Updated Mar 25, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Enrico Marocco (2018). Enrico's Email Flows [Dataset]. https://www.kaggle.com/emarock/enricos-email-flows/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 25, 2018
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Enrico Marocco
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Email archives are a great source of information about the real-world social networks people are generally most involved in. Although sharing of full email exchanges is almost never a good idea, flow metadata (i.e. who sent a message to whom, and when) can be anonymized quite effectively and still carry a lot of information.

I'm sharing over 10 years of flow metadata from my work and personal email accounts to enable data scientists experiment with their favourite statistics and social network analysis tools. A getting-started notebook is available here.

For anyone willing to extract similar datasets from their own email accounts, the tool I put together for producing mine is available at https://github.com/emarock/mailfix (currently supports extraction from Gmail accounts, IMAP accounts and Apple Mail on macOS).

Content

This dataset contains two files:

work.csv: email flow metadata from my work account (~146,000 emails, from 2005 to 2018)

personal.csv: email flow metadata from my personal account (~41,000 emails, from 2006 to 2018)

As one should expect from any decade long archive, the data presents some partial corruptions and anomalies, that are however time-confined and should be easily identified and filtered out through basic statistical analysis. I will be happy to discuss and provide more information in the comments.

Inspiration

Basic exploration:

Who am I?

Who's human and who's not? How different are attention-seekers from mailing list engines?

How did my communication patterns change over time? Did they change in the same way in and out of work?

Did my social network grow? Did it shrink?

Who's my boss? Who were my former ones? Who'll be the next one?

I will be also available to extend the dataset with additional data for training advanced classifiers (e.g. lists of actual humans, mailing lists, VIPs...). Feel free to ask in the comments.

Anonymization and Privacy Note

The anonymization function (code here, tests here) is based on djb2 string hashing and on a Mersenne Twister pseudorandom generator, implemented in the string-hash and casual node.js modules. It should be practically irreversible, modulo implementation defects.

However, if you've ever been involved in email exchanges with me, you can work your way back to the anonymized address associated to your actual address by comparing the message timestamps. Similarly, with a little more guesswork, you can discover the anonymized addresses of those who were also involved in those exchanges. Since that is also true for them in respect to you, if that is of any concern just reach out and I'll censor the problematic entries in the dataset.
RAAAP-2 SPSS Data Cleansing syntax files
figshare.com
txt
Updated May 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simon Kerridge (2023). RAAAP-2 SPSS Data Cleansing syntax files [Dataset]. http://doi.org/10.6084/m9.figshare.18972992.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.18972992.v2
Dataset updated
May 16, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Simon Kerridge
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These two syntax files were used to convert the SPSS data output from the Qualtrics survey tool into the 17 cleansed and anonymised RAAAP-2 datasets form the 2019 international survey of research managers and administrators. The first creates and interim cleansed and anonymised datafile, the latter splits these into separate datasets to ensure anonymisation. Errata (16/6/23): v13 of the main Data Cleansing file has an error (two variables were missing value labels). This file has now been replaced with v14, and the Main Dataset has also been updated with the new data.
g
Dataset Direct Download Service (WFS): Register Parcellar anonymised graphic...
gimi9.com
Updated May 18, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Dataset Direct Download Service (WFS): Register Parcellar anonymised graphic of Martinique for the year 2010 | gimi9.com [Dataset]. https://gimi9.com/dataset/eu_fr-120066022-srv-6f417b85-b1da-4ad8-ba93-a638e3c317b9
Explore at:
Dataset updated
May 18, 2023
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Martinique
Description
— Initial seizure made by the operator on funds A4 made specifically, from the BD-ORTHO, for a seizure at 1: 5000 and pre-broadcast to the operator (seized according to procedure defined by the contracting authority);- then successive evolutions according to the processing chain implemented by the ASP (manager) according to its own tools, including the ISIS-TELEPAC application;- Anonymisation of islets (deletion of the nominal data of the islands);- Generation of a numerical identifier not significant per islet to allow the link with the data attributes- Geographical selection of plots intersecting the GEOFLA contour of the department

Facebook

Twitter

Click to copy link

Link copied

Cite

NETL (2023). Geospatial and Information Substitution and Anonymization Tool (GISA) [Dataset]. http://doi.org/10.18141/1992880

Geospatial and Information Substitution and Anonymization Tool (GISA)

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.18141/1992880

Dataset updated

Jul 31, 2023

Dataset provided by

National Energy Technology Laboratoryhttps://netl.doe.gov/
USDOE Office of Fossil Energy (FE)

Description

The Geospatial and Information Substitution and Anonymization Tool (GISA) incorporates techniques for obfuscating identifiable information from point data or documents, while simultaneously maintaining chosen variables to enable future use and meaningful analysis. This approach promotes collaboration and data sharing while also reducing the risk of exposure to sensitive information. GISA can be used in a number of different ways, including the anonymization of point spatial data, batch replacement/removal of user-specified terms from file names and from within file content, and aid with the selection and redaction of images and terms based on recommendations using natural language processing. Version 1 of the tool, published here, has updated functionality and enhanced capabilities to the beta version published in 2023. Please see User Documentation for further information on capabilities, as well as a guide for how to download and use the tool. If there are any feedback you would like to provide for the tool, please reach out with your feedback to edxsupport@netl.doe.gov. Disclaimer: This project was funded by the United States Department of Energy, National Energy Technology Laboratory, in part, through a site support contract. Neither the United States Government nor any agency thereof, nor any of their employees, nor the support contractor, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof. The Geospatial and Information Substitution and Anonymization Tool (GISA) was developed jointly through the U.S. DOE Office of Fossil Energy and Carbon Management’s EDX4CCS Project, in part, from the Bipartisan Infrastructure Law.

Clear search

Close search

Google apps

Main menu

Geospatial and Information Substitution and Anonymization Tool (GISA)

Data Masking Tools Report

Data Creation Tool Report

Healthcare Data Anonymization Services Market Research Report 2033

Healthcare Data Anonymization Services Market Outlook

PRIEST study anonymised dataset

Anonymize or Synthesize? – Privacy-Preserving Methods for Heart Failure...

Data De-Identification Tools Report

Data from: Extracted and Anonymised Qualitative Data on Students' Acceptance...

Anonymize or Synthesize? – Privacy-Preserving Methods for Heart Failure...

802.11 Managemement frames from a public location

Data from: Mobilising Voluntary Action in the Four UK Jurisdictions:...

FAIRsharing record for: Integrating Data for Analysis, Anonymization, and...

SAP Selective Test Data Management Tools Report

Global Data De-Identification Or Pseudonymity Software Market Size By...

CARMEN-I: A resource of anonymized electronic health records in Spanish and...

Quantitative data from EDSA demand analysis

CRAWDAD tools/process/pcap/WiPal (v. 2010-01-13)

Enrico's Email Flows

Context

Content

Inspiration

Anonymization and Privacy Note

RAAAP-2 SPSS Data Cleansing syntax files

Dataset Direct Download Service (WFS): Register Parcellar anonymised graphic...

Geospatial and Information Substitution and Anonymization Tool (GISA)