Sensitive Regulated Data: Permitted and Restricted UsesPurposeScope and AuthorityStandardViolation of the Standard - Misuse of InformationDefinitionsReferencesAppendix A: Personally Identifiable Information (PII)Appendix B: Security of Personally Owned Devices that Access or Maintain Sensitive Restricted DataAppendix C: Sensitive Security Information (SSI)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset consists of three different privacy policy corpora (in English and Italian) composed of 81 unique privacy policy texts spanning the period 2018-2021. This dataset makes available an example of three corpora of privacy policies. The first corpus is the English-language corpus, the original used in the study by Tang et al. [2]. The other two are cross-language corpora built (one, the source corpus, in English, and the other, the replication corpus, in Italian, which is the language of a potential replication study) from the first corpus.
The policies were collected from:
We manually analyzed the Alexa top 10 Italy websites as of November 2021. Analogously, we analyzed selected apps that, in the same period, had ranked better in the "most profitable games" category of the Play Store for Italy.
All the privacy policies are ANSI-encoded text files and have been manually read and verified.
The dataset is helpful as a starting point for building comparable cross-language privacy policies corpora. The availability of these comparable cross-language privacy policies corpora helps replicate studies in different languages.
Details on the methodology can be found in the accompanying paper.
The available files are as follows:
This dataset is the original dataset used in the publication [1]. The original English U.S. corpus is described in the publication [2].
[1] F. Ciclosi, S. Vidor and F. Massacci. "Building cross-language corpora for human understanding of privacy policies." Workshop on Digital Sovereignty in Cyber Security: New Challenges in Future Vision. Communications in Computer and Information Science. Springer International Publishing, 2023, In press.
[2] J. Tang, H. Shoemaker, A. Lerner, and E. Birrell. Defining Privacy: How Users Interpret Technical Terms in Privacy Policies. Proceedings on Privacy Enhancing Technologies, 3:70–94, 2021.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A dataset of privacy policies in the Greek language, with policies coming from top visited websites in Greece with a privacy policy in the Greek language.
The dataset, as well as results of its analysis are included.
if you want to use this dataset, please cite the relevant conference publication:
Georgia M. Kapitsaki and Maria Papoutsoglou, "A privacy policies dataset in Greek in the GDPR era, in Proceedings of the 27th Pan-Hellenic Conference on Informatics, PCI 2023.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data consists in crawled privacy policies from European privacy policies. They were split into paragraphs and annotated as containing or not personal data.
The question that was asked to annotators was "Does this paragraph contain the explicit mention of specific personal data (e.g. name, phone number, social security, …) being collected?".
A full description of the dataset can be found in D3.4 of the SMOOTH project
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comprehensive dataset outlining Advatec’s privacy policy, including data collection practices, user rights, GDPR compliance, and third-party data handling procedures.
A survey conducted from April to May 2022 found that six in 10 organizations in the United States designated an internal project manager or owner to manage compliance with state-level privacy laws. Around half of the organizations conducted data mapping and had an understanding of data practices across the organization. A further 41 percent said they updated privacy policies, while 40 percent said they were in the process of doing so.
By 2024, the share of the global population to be covered under modern privacy regulations is projected to reach 75 percent. The forecast for the year 2023 was 65 percent. Additionally, in 2020, only ten percent of the global population's privacy was protected by modern laws.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
115 privacy policies from the OPP-115 corpus have been re-annotated with the specific data retention periods disclosed, aligned with the GDPR requirements disclosed in Art. 13 (2)(a). Those retention periods have been categorized into the following 6 distinct cases:
C0: No data retention period is indicated in the privacy policy/segment. C1: A specific data retention period is indicated (e.g., days, weeks, months...). C2: Indicate that the data will be stored indefinitely. C3: A criterion is determined during which a defined period during which the data will be stored can be understood (e.g., as long as the user has an active account). C4: It is indicated that personal data will be stored for an unspecified period, for fraud prevention, legal or security reasons. C5: It is indicated that personal data will be stored for an unspecified period, for purposes other than fraud prevention, legal, or security. Note: If the privacy policy or segment accounts for more than one case, the case with the highest value was annotated (e.g., if case C2 and case C4 apply, C4 is annotated).
Then, the ground truth dataset served as validation for our proposed ChatGPT-based method, the results of which have also been included in this dataset.
Columns description: - policy_id: ID of the policy in the OPP-115 dataset - policy_name: Domain of the privacy policy - policy_text: Privacy policy collected at the time of OPP-115 dataset creation - info_type_value: Type of personal data to which data retention refers - retention_period: Period of retention annotated by OPP-115 annotators - actual_case: Our annotated case ranging from C0-C5 - GPT_case: ChatGPT classification of the case identified in the segment - actual_Comply_GDPR: Boolean denoting True if they apparently comply with GDPR (cases C1-C5) or False if not (case C0) - GPT_Comply_GDPR: Boolean denoting True if they apparently comply with GDPR (cases C1-C5) or False if not (case C0) - paragraphs_retention_period: List containing the paragraphs annotated as Data Retention by OPP-115 annotators and our red text describing the relevant information used for our annotation decision
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
This dataset supplements publication "Multilingual Scraper of Privacy Policies and Terms of Service" at ACM CSLAW’25, March 25–27, 2025, München, Germany. It includes the first 12 months of scraped policies and terms from about 800k websites, see concrete numbers below.
The following table lists the amount of websites visited per month:
Month | Number of websites |
---|---|
2024-01 | 551'148 |
2024-02 | 792'921 |
2024-03 | 844'537 |
2024-04 | 802'169 |
2024-05 | 805'878 |
2024-06 | 809'518 |
2024-07 | 811'418 |
2024-08 | 813'534 |
2024-09 | 814'321 |
2024-10 | 817'586 |
2024-11 | 828'662 |
2024-12 | 827'101 |
The amount of websites visited should always be higher than the number of jobs (Table 1 of the paper) as a website may redirect, resulting in two websites scraped or it has to be retried.
To simplify the access, we release the data in large CSVs. Namely, there is one file for policies and another for terms per month. All of these files contain all metadata that are usable for the analysis. If your favourite CSV parser reports the same numbers as above then our dataset is correctly parsed. We use ‘,’ as a separator, the first row is the heading and strings are in quotes.
Since our scraper sometimes collects other documents than policies and terms (for how often this happens, see the evaluation in Sec. 4 of the publication) that might contain personal data such as addresses of authors of websites that they maintain only for a selected audience. We therefore decided to reduce the risks for websites by anonymizing the data using Presidio. Presidio substitutes personal data with tokens. If your personal data has not been effectively anonymized from the database and you wish for it to be deleted, please contact us.
The uncompressed dataset is about 125 GB in size, so you will need sufficient storage. This also means that you likely cannot process all the data at once in your memory, so we split the data in months and in files for policies and terms.
The files have the following names:
Both files contain the following metadata columns:
website_month_id
- identification of crawled websitejob_id
- one website can have multiple jobs in case of redirects (but most commonly has only one)website_index_status
- network state of loading the index page. This is resolved by the Chromed DevTools Protocol.
DNS_ERROR
- domain cannot be resolvedOK
- all fineREDIRECT
- domain redirect to somewhere elseTIMEOUT
- the request timed outBAD_CONTENT_TYPE
- 415 Unsupported Media TypeHTTP_ERROR
- 404 errorTCP_ERROR
- error in the network connectionUNKNOWN_ERROR
- unknown errorwebsite_lang
- language of index page detected based on langdetect
librarywebsite_url
- the URL of the website sampled from the CrUX list (may contain subdomains, etc). Use this as a unique identifier for connecting data between months.job_domain_status
- indicates the status of loading the index page. Can be:
OK
- all works well (at the moment, should be all entries)BLACKLISTED
- URL is on our list of blocked URLsUNSAFE
- website is not safe according to save browsing API by GoogleLOCATION_BLOCKED
- country is in the list of blocked countriesjob_started_at
- when the visit of the website was startedjob_ended_at
- when the visit of the website was endedjob_crux_popularity
- JSON with all popularity ranks of the website this monthjob_index_redirect
- when we detect that the domain redirects us, we stop the crawl and create a new job with the target URL. This saves time if many websites redirect to one target, as it will be crawled only once. The index_redirect
is then the job.id
corresponding to the redirect target.job_num_starts
- amount of crawlers that started this job (counts restarts in case of unsuccessful crawl, max is 3)job_from_static
- whether this job was included in the static selection (see Sec. 3.3 of the paper)job_from_dynamic
- whether this job was included in the dynamic selection (see Sec. 3.3 of the paper) - this is not exclusive with from_static
- both can be true when the lists overlap.job_crawl_name
- our name of the crawl, contains year and month (e.g., 'regular-2024-12' for regular crawls, in Dec 2024)policy_url_id
- ID of the URL this policy haspolicy_keyword_score
- score (higher is better) according to the crawler's keywords list that given document is a policypolicy_ml_probability
- probability assigned by the BERT model that given document is a policypolicy_consideration_basis
- on which basis we decided that this url is policy. The following three options are executed by the crawler in this order:
policy_url
- full URL to the policypolicy_content_hash
- used as identifier - if the document remained the same between crawls, it won't create a new entrypolicy_content
- contains the text of policies and terms extracted to Markdown using Mozilla's readability
librarypolicy_lang
- Language detected by fasttext of the contentAnalogous to policy data, just substitute policy
to terms
.
Check this Google Docs for an updated version of this README.md.
https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html
This dataset has been collected and annotated by Terms of Service; Didn't Read (ToS;DR), an independent project aimed at analyzing and summarizing the terms of service and privacy policies of various online services. ToS;DR helps users understand the legal agreements they accept when using online platforms by categorizing and evaluating specific cases related to these policies.
The dataset includes structured information on individual cases, broader topics, specific services, detailed documents, and key points extracted from legal texts.
Cases refer to individual legal cases or specific issues related to the terms of service or privacy policies of a particular online service. Each case typically focuses on a specific aspect of a service's terms, such as data collection, user rights, content ownership, or security practices.
Topics are general categories or themes that encompass various cases. They help organize and group similar cases together based on the type of issues they address. For example, "Data Collection" could be a topic that includes cases related to how a service collects and uses user data.
Services represent specific online platforms, websites, or applications that have their own terms of service and privacy policies.
Points are individual statements or aspects within a case that highlight important information about a service's terms of service or privacy policy. These points can be positive (e.g., strong privacy protections) or negative (e.g., data sharing with third parties).
Documents refer to the original terms of service and privacy policies of the services that are being analyzed on TOSDR. These documents are the source of information for the cases, points, and ratings provided on the platform. TOSDR links to the actual documents, so users can review the full details if they choose to.
This Privacy Notice sets out:
This data was collected by the Office of the National Coordinator for Health IT in coordination with Clinovations and the George Washington University Milken Institute of Public Health. ONC and its partners collected the data through research of state government and health information organization websites. The dataset provides policy and law details for four distinct policies or laws, and, where available, hyperlinks to official state records or websites. These four policies or laws are: 1) State Health Information Exchange (HIE) Consent Policies; 2) State-Sponsored HIE Consent Policies; 3) State Laws Requiring Authorization to Disclose Mental Health Information for Treatment, Payment, and Health Care Operations (TPO); and 4) State Laws that Apply a Minimum Necessary Standard to Treatment Disclosures of Mental Health Information.
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The market for Privacy Policy Generator Software is experiencing a steady growth, driven by increasing concerns over data privacy and regulatory compliance. The market size stood at $260.3 million in 2025, and is projected to reach $589.9 million by 2033, exhibiting a CAGR of 10.8% from 2025 to 2033. The proliferation of personal data collection, coupled with stringent data protection regulations like GDPR and CCPA, is propelling the adoption of this software among businesses. Key market trends include the rise of cloud-based solutions, catering to the growing need for flexibility and reduced infrastructure costs. Large enterprises are actively leveraging these solutions to manage the complexities of privacy compliance. Additionally, the increasing adoption of privacy policies across verticals such as e-commerce, healthcare, and financial services is further fueling the market growth. The major players in the market include Termly.io, iubenda, Get Terms, PrivacyPolicies.com, IBM, Seers Co., Termageddon, LLC, TermsFeed, and AppPrivacy.com. North America remains the dominant region for this market, followed by Europe and Asia-Pacific.
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
The Office of the Privacy Commissioner of Canada (OPC) commissioned Phoenix Strategic Perspectives (Phoenix SPI) to conduct quantitative research with Canadian businesses on privacy‐related issues. To address its information needs, the OPC conducts surveys with businesses every two years to inform and guide outreach efforts. The objectives of this research were to collect data on the type of privacy policies and practices businesses have in place; on businesses’ compliance with the law; and on businesses’ awareness and approaches to privacy protection. The findings will be used to help the OPC provide guidance to both individuals and organizations on privacy issues; and enhance its outreach efforts with small businesses, which can be an effective way to achieve positive change for privacy protection. A 13‐minute telephone survey was administered to 1,003 companies across Canada between November 29 and December 19, 2019. The target respondents were senior decision makers with responsibility and knowledge of their company’s privacy and security practices. Businesses were divided by size for sampling purposes: small (1 to 19 employees); medium (20 to 99 employees); and large (100 employees or more). The results were weighted by size, sector and region using Statistics Canada data to ensure that they reflect the actual distribution of businesses in Canada. Based on a sample of this size, the results can be considered accurate to within ±3.1%, 19 times out of 20.
There has been a rising awareness of data privacy among mobile app users in China. As of the early of 2020, around ** percent of the Chinese respondents in a survey said that they had read privacy terms on mobile apps carefully before agreeing to the conditions. Compared to the same survey which was conducted in 2018, the reading rate had increased, although it was still relatively more common for consumers to consent without reading the policies.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The global market for Privacy Policy Generator Software is experiencing robust growth, projected to reach $276 million in 2025 and maintain a Compound Annual Growth Rate (CAGR) of 11.3% from 2025 to 2033. This expansion is fueled by several key drivers. Increasingly stringent data privacy regulations, like GDPR and CCPA, are compelling businesses of all sizes – from large enterprises to small and medium-sized enterprises (SMEs) – to adopt robust privacy policies. The rising adoption of cloud-based and web-based solutions further contributes to market growth, providing businesses with accessible and cost-effective tools to manage their compliance needs. The market is segmented by application (Large Enterprises and SMEs) and type (Cloud-Based and Web-Based), with cloud-based solutions gaining significant traction due to their scalability and ease of use. Competitive landscape is dynamic, featuring established players like IBM alongside specialized providers such as Termly.io, iubenda, Get Terms, PrivacyPolicies.com, Seers Co, Termageddon,LLC, and TermsFeed. The geographical distribution shows strong presence across North America, Europe, and Asia Pacific, reflecting the global reach of data privacy concerns. Future growth will likely be driven by the continuing evolution of data privacy regulations, increasing cyber security threats, and the growing demand for user data transparency and control. The continued digital transformation of businesses worldwide ensures sustained demand for user-friendly and effective privacy policy generation tools. The market is expected to witness further fragmentation as specialized solutions catering to niche industries and specific regulatory requirements emerge. North America is likely to retain its leading market share due to early adoption of privacy regulations and a high concentration of technology companies. However, regions like Asia Pacific are expected to showcase significant growth potential in the coming years, driven by increasing internet penetration and the implementation of stricter data privacy regulations. The competition within the market is likely to intensify as vendors continue to enhance their product offerings with features such as automated policy updates, multi-lingual support, and seamless integration with other compliance tools.
This statistic shows the results of a survey about the share of users reading privacy policies or terms and conditions for internet sites or apps in Australia as of August 2018. During the survey period, around ** percent of respondents stated to never read the terms and conditions or privacy policy of a website, compared to **** percent of respondents claiming to read them every time.
The documents contained in this dataset reflect NASA's comprehensive IT policy in compliance with Federal Government laws and regulations.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The Security Policy Management (SPM) solutions market is experiencing robust growth, driven by the escalating need for enhanced cybersecurity in a rapidly evolving digital landscape. The increasing complexity of IT infrastructure, coupled with the proliferation of cloud services and remote work, necessitates robust and centralized policy management. Organizations face mounting pressure to comply with stringent data privacy regulations like GDPR and CCPA, further fueling the demand for sophisticated SPM solutions. The market, estimated at $5 billion in 2025, is projected to witness a Compound Annual Growth Rate (CAGR) of 12% through 2033, reaching an estimated $12 billion by then. This growth is propelled by the adoption of advanced technologies like artificial intelligence (AI) and machine learning (ML) to automate policy enforcement and improve threat detection. Leading vendors such as Google, Amazon, Cisco, and Check Point are investing heavily in R&D to offer innovative SPM solutions that address the dynamic needs of enterprises across various sectors. The market segmentation reveals significant opportunities within specific verticals. The financial services sector, for example, is expected to drive significant growth due to strict regulatory compliance demands and the high value of the data they manage. Similarly, the healthcare industry faces increasing pressure to protect sensitive patient information, contributing to strong growth in this segment. However, the market faces certain restraints, including the high initial investment costs associated with implementing SPM solutions and the complexity of integrating these solutions with existing IT infrastructure. Despite these challenges, the long-term outlook for the SPM market remains positive, fueled by continuous innovation and increasing awareness of the importance of effective security policy management.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Article Contributions This file enumerates and explains the distribution of the datasets in the files.
Privacy Policies dataset This dataset ["Policies_urls.csv"] contains 142 privacy policy URLs with the corresponding organization. These URLs were obtained with the two methods (Selenium & Google) described in the article. This is the reason for duplicated URLs.
300 Domain Holders This dataset ["300_domain_holders.xlsx"] contains three different sheets for each of the datasets used for the validations described in the article i.e. Fortune 500, PII_receivers_1 (for the technique's evaluation) and PII_receivers_2 (for ROI's evaluation).
Recipient Domains this dataset ["Domains_receiving_PII.csv"] contains the 40,493 dataflows corresponding to the 1,112 unique domains receiving personal data from Android apps.
Sensitive Regulated Data: Permitted and Restricted UsesPurposeScope and AuthorityStandardViolation of the Standard - Misuse of InformationDefinitionsReferencesAppendix A: Personally Identifiable Information (PII)Appendix B: Security of Personally Owned Devices that Access or Maintain Sensitive Restricted DataAppendix C: Sensitive Security Information (SSI)