The Geospatial and Information Substitution and Anonymization Tool (GISA) incorporates techniques for obfuscating identifiable information from point data or documents, while simultaneously maintaining chosen variables to enable future use and meaningful analysis. This approach promotes collaboration and data sharing while also reducing the risk of exposure to sensitive information. GISA can be used in a number of different ways, including the anonymization of point spatial data, batch replacement/removal of user-specified terms from file names and from within file content, and aid with the selection and redaction of images and terms based on recommendations using natural language processing. Version 1 of the tool, published here, has updated functionality and enhanced capabilities to the beta version published in 2023. Please see User Documentation for further information on capabilities, as well as a guide for how to download and use the tool. If there are any feedback you would like to provide for the tool, please reach out with your feedback to edxsupport@netl.doe.gov. Disclaimer: This project was funded by the United States Department of Energy, National Energy Technology Laboratory, in part, through a site support contract. Neither the United States Government nor any agency thereof, nor any of their employees, nor the support contractor, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof. The Geospatial and Information Substitution and Anonymization Tool (GISA) was developed jointly through the U.S. DOE Office of Fossil Energy and Carbon Management’s EDX4CCS Project, in part, from the Bipartisan Infrastructure Law.
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The global market for data masking tools is experiencing robust growth, driven by increasing regulatory compliance needs (like GDPR and CCPA), the rising adoption of cloud computing, and the expanding volume of sensitive data requiring protection. The market, currently estimated at $2.5 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033. This growth is fueled by organizations' increasing focus on data security and privacy, particularly within sectors like healthcare, finance, and government. The demand for sophisticated data masking solutions that can effectively anonymize and pseudonymize data while maintaining data utility for testing and development is a significant driver. Furthermore, the shift towards cloud-based data masking solutions, offering scalability and ease of management, is contributing to market expansion. Several key trends are shaping the market. The integration of advanced technologies such as AI and machine learning into data masking tools is enhancing their effectiveness and automating complex masking processes. The emergence of data masking solutions designed for specific data types, such as personally identifiable information (PII) and financial data, caters to niche requirements. However, challenges such as the complexity of implementing and managing data masking solutions, and concerns about the potential impact on data usability, represent restraints on market growth. The market is segmented by deployment type (cloud, on-premises), organization size (small, medium, large enterprises), and industry vertical (healthcare, finance, etc.). Key players in this space include Oracle, Delphix, BMC Software, Informatica, IBM, and several other specialized vendors offering a range of solutions to meet diverse organizational needs. The competitive landscape is dynamic, with ongoing innovation and consolidation shaping the future of the market.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The Data Creation Tool market, currently valued at $7.233 billion (2025), is experiencing robust growth, projected to expand at a Compound Annual Growth Rate (CAGR) of 18.2% from 2025 to 2033. This significant expansion is driven by the increasing need for high-quality synthetic data across various sectors, including software development, machine learning, and data analytics. Businesses are increasingly adopting these tools to accelerate development cycles, improve data testing and validation processes, and enhance the training and performance of AI models. The rising demand for data privacy and regulatory compliance further fuels this growth, as synthetic data offers a viable alternative to real-world data while preserving sensitive information. Key players like Informatica, Broadcom (with its EDMS solutions), and Delphix are leveraging their established positions in data management to capture significant market share. Emerging players like Keymakr and Mostly AI are also contributing to innovation with specialized solutions focusing on specific aspects of data creation, such as realistic data generation and streamlined workflows. The market segmentation, while not explicitly provided, can be logically inferred. We can anticipate segments based on deployment (cloud, on-premise), data type (structured, unstructured), industry vertical (financial services, healthcare, retail), and functionality (data generation, data masking, data anonymization). Competitive dynamics are shaping the market with established players facing pressure from innovative startups. The forecast period of 2025-2033 indicates a substantial market expansion opportunity, influenced by factors like advancements in AI/ML technologies that demand massive datasets, and the growing adoption of Agile and DevOps methodologies in software development, both of which rely heavily on efficient data creation tools. Understanding specific regional breakdowns and further market segmentation is crucial for developing targeted business strategies and accurately assessing investment potential.
According to our latest research, the global healthcare data anonymization services market size reached USD 1.42 billion in 2024, reflecting a robust expansion driven by increasing regulatory demands and heightened focus on patient privacy. The market is projected to grow at a CAGR of 15.8% from 2025 to 2033, with the total market value expected to reach USD 5.44 billion by 2033. This impressive growth trajectory is underpinned by the rising adoption of digital health solutions, stringent data protection laws, and the ongoing digitalization of healthcare records worldwide.
The primary growth factor fueling the healthcare data anonymization services market is the proliferation of electronic health records (EHRs) and the expanding use of big data analytics in healthcare. As healthcare providers and organizations increasingly leverage advanced analytics for improving patient outcomes, there is a corresponding surge in data generation. However, these vast datasets often contain sensitive patient information, making data anonymization essential to ensure compliance with regulations such as HIPAA, GDPR, and other regional privacy laws. The increasing frequency of data breaches and cyberattacks has further highlighted the importance of robust anonymization services, prompting healthcare organizations to prioritize investments in data privacy and security solutions. As a result, demand for both software and service-based anonymization solutions continues to rise, contributing significantly to market growth.
Another key driver for the healthcare data anonymization services market is the growing emphasis on research and clinical trials, which require the sharing and analysis of large volumes of patient data. Pharmaceutical and biotechnology companies, as well as research organizations, are increasingly collaborating across borders, necessitating the anonymization of datasets to protect patient identities and comply with international data protection standards. The adoption of cloud-based healthcare solutions has also facilitated the secure and efficient sharing of anonymized data, supporting advancements in personalized medicine and population health management. As organizations seek to balance innovation with compliance, the demand for advanced anonymization technologies that offer high accuracy and scalability is expected to accelerate further.
Technological advancements in artificial intelligence (AI) and machine learning (ML) are also shaping the future of the healthcare data anonymization services market. These technologies are enabling more sophisticated and automated anonymization processes, reducing the risk of re-identification while maintaining data utility for research and analytics. The integration of AI-driven tools into anonymization workflows is helping organizations streamline operations, minimize human error, and achieve greater compliance with evolving regulatory requirements. Additionally, the increasing availability of customizable and interoperable anonymization solutions is making it easier for healthcare organizations of all sizes to adopt and scale these services, thereby broadening the market’s reach and impact.
From a regional perspective, North America continues to dominate the healthcare data anonymization services market, accounting for the largest share in 2024. This leadership position is attributed to the presence of advanced healthcare infrastructure, widespread adoption of EHRs, and strict regulatory frameworks governing patient data privacy. Europe follows closely, driven by the enforcement of the General Data Protection Regulation (GDPR) and a strong culture of data protection. The Asia Pacific region is witnessing the fastest growth, propelled by increasing healthcare digitalization, government initiatives to modernize healthcare systems, and rising awareness of data privacy among patients and providers. Latin America and the Middle East & Africa are also experiencing steady growth, albeit from a smaller base, as healthcare organizations in these regions begin to prioritize data security and compliance.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The PRIEST study used patient data from the early phases of the COVID-19 pandemic. The PRIEST study provided descriptive statistics of UK patients with suspected COVID-19 in an emergency department cohort, analysis of existing triage tools, and derivation and validation of a COVID-19 specific tool for adults with suspected COVID-19. For more details please go to the study website:https://www.sheffield.ac.uk/scharr/research/centres/cure/priestFiles contained in PRIEST study data repository Main files include:PRIEST.csv dataset contains 22445 observations and 119 variables. Data include initial presentation and follow-up, one row per participant.PRIEST_variables.csv contains variable names, values and brief description.Additional files include:Follow-up v4.0 PDF - Blank 30-day follow-up data collection toolPandemic Respiratory Infection Form v7 PDF - Blank baseline data collection toolPRIEST protocol v11.0_17Aug20 PDF - Study protocolPRIEST_SAP_v1.0_19jun20 PDF - Statistical analysis planThe PRIEST data sharing plan follows a controlled access model as described in Good Practice Principles for Sharing Individual Participant Data from Publicly Funded Clinical Trials. Data sharing requests should be emailed to priest-study@sheffield.ac.uk. Data sharing requests will be considered carefully as to whether it is necessary to fulfil the purpose of the data sharing request. For approval of a data sharing request an approved ethical review and study protocol must be provided. The PRIEST study was approved by NRES Committee North West - Haydock. REC reference: 12/NW/0303
In the publication [1] we implemented anonymization and synthetization techniques for a structured data set, which was collected during the HiGHmed Use Case Cardiology study [2]. We employed the data anonymization tool ARX [3] and the data synthetization framework ASyH [4] individually and in combination. We evaluated the utility and shortcomings of the different approaches by statistical analyses and privacy risk assessments. Data utility was assessed by computing two heart failure risk scores (Barcelona BioHF [5] and MAGGIC [6]) on the protected data sets. We observed only minimal deviations to scores from the original data set. Additionally, we performed a re-identification risk analysis and found only minor residual risks for common types of privacy threats. We could demonstrate that anonymization and synthetization methods protect privacy while retaining data utility for heart failure risk assessment. Both approaches and a combination thereof introduce only minimal deviations from the original data set over all features. While data synthesis techniques produce any number of new records, data anonymization techniques offer more formal privacy guarantees. Consequently, data synthesis on anonymized data further enhances privacy protection with little impacting data utility. We hereby share all generated data sets with the scientific community through a use and access agreement.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The market for data de-identification tools is experiencing robust growth, driven by increasing regulatory scrutiny around data privacy (like GDPR and CCPA), the rising volume of sensitive data being generated and processed, and a growing awareness of the potential risks associated with data breaches. The market, estimated at $2 billion in 2025, is projected to experience a Compound Annual Growth Rate (CAGR) of 15% between 2025 and 2033, reaching an estimated $7 billion by 2033. This expansion is fueled by the adoption of advanced techniques like differential privacy and homomorphic encryption, allowing organizations to derive insights from data while safeguarding individual privacy. Key trends include the increasing demand for integrated solutions that combine data de-identification with other data security measures, a shift towards cloud-based solutions for enhanced scalability and accessibility, and the growing adoption of AI and machine learning for automating data de-identification processes. However, challenges remain, including the complexity of implementing de-identification techniques, concerns around the accuracy and effectiveness of these tools, and the ongoing evolution of privacy regulations requiring continuous adaptation. The market is highly competitive, with a range of established players and emerging startups vying for market share. This competitive landscape encompasses both large multinational corporations like IBM and Salesforce, offering comprehensive data management and security platforms, and smaller, more specialized companies such as PrivacyOne and Very Good Security, focusing on specific de-identification techniques and data protection solutions. The diverse range of solutions reflects the nuanced requirements across different industries and data types. The segment breakdown likely includes solutions tailored to healthcare, finance, and other sectors with stringent privacy regulations. Geographic distribution will likely show stronger market penetration in regions with robust data protection regulations and a strong emphasis on digital transformation, such as North America and Europe. Continued innovation in areas such as federated learning and privacy-enhancing technologies will further shape the trajectory of this rapidly evolving market.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data published in this record was adopted in the following study:
Exploring Higher Education students' experience with AI-powered educational tools: The case of an Early Warning System
The study analyses the students' experience of an early warning system developed at a fully online university. The study is based on 21 semi-structured interviews that yielded a corpus of 21,761 words, for which a mixed inductive and deductive codification approach was applied after thematic analysis. We focused on 11 themes, 52 subthemes, and 396 coded segments to perform content analysis. Our findings revealed that the students, primarily senior workers with a high-level academic self-efficacy, had little experience with this type of system and low expectations about it. However, a usage experience triggered interest and meaningful reflections on the mentioned tool. Nevertheless, a comparative analysis between disciplines related to Computer Science and Economics showed higher confidence and expectation about the system and artificial intelligence overall by the first group. These results highlight the relevance of supporting students' further experiences and understanding of artificial intelligence systems in education to accept them and mainly to participate in iterative development processes of such tools to achieve quality, relevance, and fairness.
The three records attached as part of the dataset include:
1- The General CodeTree with exemplar coding excerpts in Spanish
2- Extract of transcriptions in English
3- Full Report in Spanish as extracted from NVIVO, including the extracted codes for the synthesis (1,2) in blue, and the comments made by the two researchers engaged in the interrater agreement.
4- General Content Analysis (Spreadsheet ODS)
https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.11588/DATA/MXM0Q2https://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.11588/DATA/MXM0Q2
In the publication [1] we implemented anonymization and synthetization techniques for a structured data set, which was collected during the HiGHmed Use Case Cardiology study [2]. We employed the data anonymization tool ARX [3] and the data synthetization framework ASyH [4] individually and in combination. We evaluated the utility and shortcomings of the different approaches by statistical analyses and privacy risk assessments. Data utility was assessed by computing two heart failure risk scores (Barcelona BioHF [5] and MAGGIC [6]) on the protected data sets. We observed only minimal deviations to scores from the original data set. Additionally, we performed a re-identification risk analysis and found only minor residual risks for common types of privacy threats. We could demonstrate that anonymization and synthetization methods protect privacy while retaining data utility for heart failure risk assessment. Both approaches and a combination thereof introduce only minimal deviations from the original data set over all features. While data synthesis techniques produce any number of new records, data anonymization techniques offer more formal privacy guarantees. Consequently, data synthesis on anonymized data further enhances privacy protection with little impacting data utility. We hereby share all generated data sets with the scientific community through a use and access agreement. [1] Johann TI, Otte K, Prasser F, Dieterich C: Anonymize or synthesize? Privacy-preserving methods for heart failure score analytics. Eur Heart J 2024;. doi://10.1093/ehjdh/ztae083 [2] Sommer KK, Amr A, Bavendiek, Beierle F, Brunecker P, Dathe H et al. Structured, harmonized, and interoperable integration of clinical routine data to compute heart failure risk scores. Life (Basel) 2022;12:749. [3] Prasser F, Eicher J, Spengler H, Bild R, Kuhn KA. Flexible data anonymization using ARX—current status and challenges ahead. Softw Pract Exper 2020;50:1277–1304. [4] Johann TI, Wilhelmi H. ASyH—anonymous synthesizer for health data, GitHub, 2023. Available at: https://github.com/dieterich-lab/ASyH. [5] Lupón J, de Antonio M, Vila J, Peñafiel J, Galán A, Zamora E, et al. Development of a novel heart failure risk tool: the Barcelona bio-heart failure risk calculator (BCN Bio-HF calculator). PLoS One 2014;9:e85466. [6] Pocock SJ, Ariti CA, McMurray JJV, Maggioni A, Køber L, Squire IB, et al. Predicting survival in heart failure: a risk score based on 39 372 patients from 30 studies. Eur Heart J 2013;34:1404–1413.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
About
The following datasets were captured at a busy Belgian train station between 9pm and 10pm, it contains all 802.11 management frames that were captured. both datasets were captured with approximately 20 minutes between then.
Both datasets are represented by a pcap and CSV file. The CSV file contains the frame type, timestamps, signal strength, SSID and MAC addresses for every frame. In the pcap file, all generic 802.11 elements were removed for anonymization purposes.
Anonymization
All frames were anonymized by removing identifying information or renaming identifiers. Concretely, the following transformations were applied to both datasets:
In the pcap file, anonymization actions could lead to "corrupted" frames because length tags do not correspond with the actual data. However, the file and its frames are still readable in packet analyzing tools such as Wireshark or Scapy.
The script which was used to anonymize is available in the dataset.
Data
N/o | Dataset 1 | dataset 2 |
---|---|---|
Frames | 36306 | 60984 |
Beacon frames | 19693 | 27983 |
Request frames | 798 | 1580 |
Response frames | 15815 | 31421 |
Identified Wi-Fi Networks | 54 | 70 |
Identified MAC addresses | 2092 | 2705 |
Identified Wireless devices | 128 | 186 |
Capturetime | 480s | 422s |
Dataset contents
The two datasets are stored in the directories `1/` and `2/`. Each directory contains:
`anonymization.py` is the script which was used to remove identifiers.
`README.md` contains the documentation about the datasets
License
Copyright 2022-2023 Benjamin Vermunicht, Beat Signer, Maxim Van de Wynckel, Vrije Universiteit Brussel
Permission is hereby granted, free of charge, to any person obtaining a copy of this dataset and associated documentation files (the “Dataset”), to deal in the Dataset without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Dataset, and to permit persons to whom the Dataset is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions that make use of the Dataset.
THE DATASET IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE DATASET OR THE USE OR OTHER DEALINGS IN THE DATASET.
The data collections consists of two sets of anonymised user data, containing location of volunteering activity, socio-demographic factors and efficacy factors e.g. time taken to onboard volunteers, speed of deployment, number of deployments will be created for status on day one of months March 2020 – March 2021. Data tables will be in CSV.
This FAIRsharing record describes: Integrating Data for Analysis, Anonymization and SHaring (iDASH) is one of the National Centers for Biomedical Computing (NCBC) under the NIH Roadmap for Bioinformatics and Computational Biology. Founded in 2010, the iDASH center is hosted on the campus of the University of California, San Diego and addresses fundamental challenges to research progress and enables global collaborations anywhere and anytime. Driving biological projects motivate, inform, and support tool development in iDASH. iDASH collaborates with other NCBCs and disseminates tools via annual workshops, presentations at major conferences, and scientific publications. iDASH offers a secure cyberinfrastructure and tools to support a privacy-preserving data repository and open source software. iDASH also is active in research and training in its mission area.
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
The market for SAP Selective Test Data Management Tools is experiencing robust growth, driven by increasing regulatory compliance needs, the expanding adoption of agile and DevOps methodologies, and the rising demand for faster and more efficient software testing processes. The market size in 2025 is estimated at $1.5 billion, projecting a Compound Annual Growth Rate (CAGR) of 12% from 2025 to 2033. This growth is fueled by the increasing complexity of SAP systems and the associated challenges in managing test data effectively. Large enterprises are the primary adopters of these tools, representing a significant portion of the market share, followed by medium-sized and small enterprises. The cloud-based deployment model is gaining traction due to its scalability, cost-effectiveness, and ease of access, surpassing on-premises solutions in growth rate. Key players like SAP, Informatica, and Qlik are actively shaping the market through continuous product innovation and strategic partnerships. However, challenges remain, including the high initial investment costs associated with implementing these tools, the need for specialized expertise, and data security concerns. The geographic distribution reveals North America as a dominant region, followed by Europe and Asia Pacific. Growth in the Asia Pacific region is anticipated to be particularly strong, driven by increasing digitalization and the expanding adoption of SAP solutions across various industries. The competitive landscape is marked by both established vendors and emerging players, leading to increased innovation and a wider array of solutions to meet diverse customer needs. The market is expected to continue its trajectory of growth, driven by factors such as the increasing adoption of cloud-based solutions, the growing demand for data masking and anonymization techniques, and the rising emphasis on test data quality and compliance. Companies are actively seeking solutions that streamline their testing processes, reduce costs, and minimize risks associated with inadequate test data management.
https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
Data De-Identification Or Pseudonymity Software Market size was valued at USD 431.70 Million in 2024 and is projected to reach USD 595.38 Million by 2032, growing at a CAGR of 4.10% during the forecast period 2026 to 2032.The market drivers for the Data De-Identification Or Pseudonymity Software Market can be influenced by various factors. These may include:Increasing Data Privacy Regulations Worldwide: Strict data privacy laws such as GDPR and CCPA enforce hefty fines exceeding €1 Billion from 2018 to 2023. Compliance requires adoption of data de-identification tools to protect personal data and avoid regulatory penalties.Growing Number of Data Breaches and Cyberattacks: Over 45 Million healthcare records were exposed between 2019 and 2023, highlighting risks to sensitive data. Data de-identification is essential to minimize the impact of breaches and protect individuals’ privacy in affected sectors.
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
The CARMEN-I corpus comprises 2,000 clinical records, encompassing discharge letters, referrals, and radiology reports from Hospital Clínic of Barcelona between March 2020 and March 2022. These reports, primarily in Spanish with some Catalan sections, cover COVID-19 patients with diverse comorbidities like kidney failure, cardiovascular diseases, malignancies, and immunosuppression. The corpus underwent thorough anonymization, validation, and expert annotation, replacing sensitive data with synthetic equivalents. A subset of the corpus features annotations of medical concepts by specialists, encompassing symptoms, diseases, procedures, medications, species, and humans (including family members). CARMEN-I serves as a valuable resource for training and assessing clinical NLP techniques and language models, aiding tasks like de-identification, concept detection, linguistic modifier extraction, document classification, and more. It also facilitates training researchers in clinical NLP and is a collaborative effort involving Barcelona Supercomputing Center's NLP4BIA team, Hospital Clínic, and Universitat de Barcelona's CLiC group.
This dataset provides the raw anonymised (quantitative) data from the EDSA demand analysis. This data has been gathered from surveys performed with those who identify as data scientists and manages of data scientists in different sectors across Europe. The coverage of the data includes level of current expertise of the individual or team (data scientist and manager respectively) in eight key areas. The dataset also includes the importance of the eight key areas as capabilities of a data scientist. Further the dataset includes a breakdown of key tools, technologies and training delivery methods required to enhance the skill set of data scientists across Europe. The EDSA dashboard provides an interactive view of this dataset and demonstrates how it is being used within the project. The dataset forms part of the European Data Science Academy (EDSA) project which received funding from the European Unions's Horizon 2020 research and innovation programme under grant agreement No 643937. This three year project ran/runs from February 2015 to January 2018. Important note on privacy: This dataset has been collected and made available in a pseudo anonymous way, as agreed by participants. This means that while each record represents a person, no sensitive identifiable information, such as name, email or affiliation is available (we don't even collect it). Pseudo anonymisation is never full proof, however the projects privacy impact assessment has concluded that the risk resulting from the de-anonymisation of the data is extremely low. It should be noted that data is not included of participants who did not explicitly agree that it could be shared pseudo anonymously (this was due to a change of terms after the survey had started gathering responses, meaning any early responses had come from people who didn't see this clause). If you have any concerns please contact the data publisher via the links below.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ease of use
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Email archives are a great source of information about the real-world social networks people are generally most involved in. Although sharing of full email exchanges is almost never a good idea, flow metadata (i.e. who sent a message to whom, and when) can be anonymized quite effectively and still carry a lot of information.
I'm sharing over 10 years of flow metadata from my work and personal email accounts to enable data scientists experiment with their favourite statistics and social network analysis tools. A getting-started notebook is available here.
For anyone willing to extract similar datasets from their own email accounts, the tool I put together for producing mine is available at https://github.com/emarock/mailfix (currently supports extraction from Gmail accounts, IMAP accounts and Apple Mail on macOS).
This dataset contains two files:
work.csv
: email flow metadata from my work account (~146,000 emails, from 2005 to 2018)personal.csv
: email flow metadata from my personal account (~41,000 emails, from 2006 to 2018)As one should expect from any decade long archive, the data presents some partial corruptions and anomalies, that are however time-confined and should be easily identified and filtered out through basic statistical analysis. I will be happy to discuss and provide more information in the comments.
Basic exploration:
I will be also available to extend the dataset with additional data for training advanced classifiers (e.g. lists of actual humans, mailing lists, VIPs...). Feel free to ask in the comments.
The anonymization function (code here, tests here) is based on djb2 string hashing and on a Mersenne Twister pseudorandom generator, implemented in the string-hash and casual node.js modules. It should be practically irreversible, modulo implementation defects.
However, if you've ever been involved in email exchanges with me, you can work your way back to the anonymized address associated to your actual address by comparing the message timestamps. Similarly, with a little more guesswork, you can discover the anonymized addresses of those who were also involved in those exchanges. Since that is also true for them in respect to you, if that is of any concern just reach out and I'll censor the problematic entries in the dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These two syntax files were used to convert the SPSS data output from the Qualtrics survey tool into the 17 cleansed and anonymised RAAAP-2 datasets form the 2019 international survey of research managers and administrators. The first creates and interim cleansed and anonymised datafile, the latter splits these into separate datasets to ensure anonymisation. Errata (16/6/23): v13 of the main Data Cleansing file has an error (two variables were missing value labels). This file has now been replaced with v14, and the Main Dataset has also been updated with the new data.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
— Initial seizure made by the operator on funds A4 made specifically, from the BD-ORTHO, for a seizure at 1: 5000 and pre-broadcast to the operator (seized according to procedure defined by the contracting authority);- then successive evolutions according to the processing chain implemented by the ASP (manager) according to its own tools, including the ISIS-TELEPAC application;- Anonymisation of islets (deletion of the nominal data of the islands);- Generation of a numerical identifier not significant per islet to allow the link with the data attributes- Geographical selection of plots intersecting the GEOFLA contour of the department
The Geospatial and Information Substitution and Anonymization Tool (GISA) incorporates techniques for obfuscating identifiable information from point data or documents, while simultaneously maintaining chosen variables to enable future use and meaningful analysis. This approach promotes collaboration and data sharing while also reducing the risk of exposure to sensitive information. GISA can be used in a number of different ways, including the anonymization of point spatial data, batch replacement/removal of user-specified terms from file names and from within file content, and aid with the selection and redaction of images and terms based on recommendations using natural language processing. Version 1 of the tool, published here, has updated functionality and enhanced capabilities to the beta version published in 2023. Please see User Documentation for further information on capabilities, as well as a guide for how to download and use the tool. If there are any feedback you would like to provide for the tool, please reach out with your feedback to edxsupport@netl.doe.gov. Disclaimer: This project was funded by the United States Department of Energy, National Energy Technology Laboratory, in part, through a site support contract. Neither the United States Government nor any agency thereof, nor any of their employees, nor the support contractor, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof. The Geospatial and Information Substitution and Anonymization Tool (GISA) was developed jointly through the U.S. DOE Office of Fossil Energy and Carbon Management’s EDX4CCS Project, in part, from the Bipartisan Infrastructure Law.