Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
This dataset supplements publication "Multilingual Scraper of Privacy Policies and Terms of Service" at ACM CSLAW’25, March 25–27, 2025, München, Germany. It includes the first 12 months of scraped policies and terms from about 800k websites, see concrete numbers below.
The following table lists the amount of websites visited per month:
| Month | Number of websites |
|---|---|
| 2024-01 | 551'148 |
| 2024-02 | 792'921 |
| 2024-03 | 844'537 |
| 2024-04 | 802'169 |
| 2024-05 | 805'878 |
| 2024-06 | 809'518 |
| 2024-07 | 811'418 |
| 2024-08 | 813'534 |
| 2024-09 | 814'321 |
| 2024-10 | 817'586 |
| 2024-11 | 828'662 |
| 2024-12 | 827'101 |
The amount of websites visited should always be higher than the number of jobs (Table 1 of the paper) as a website may redirect, resulting in two websites scraped or it has to be retried.
To simplify the access, we release the data in large CSVs. Namely, there is one file for policies and another for terms per month. All of these files contain all metadata that are usable for the analysis. If your favourite CSV parser reports the same numbers as above then our dataset is correctly parsed. We use ‘,’ as a separator, the first row is the heading and strings are in quotes.
Since our scraper sometimes collects other documents than policies and terms (for how often this happens, see the evaluation in Sec. 4 of the publication) that might contain personal data such as addresses of authors of websites that they maintain only for a selected audience. We therefore decided to reduce the risks for websites by anonymizing the data using Presidio. Presidio substitutes personal data with tokens. If your personal data has not been effectively anonymized from the database and you wish for it to be deleted, please contact us.
The uncompressed dataset is about 125 GB in size, so you will need sufficient storage. This also means that you likely cannot process all the data at once in your memory, so we split the data in months and in files for policies and terms.
The files have the following names:
Both files contain the following metadata columns:
website_month_id - identification of crawled websitejob_id - one website can have multiple jobs in case of redirects (but most commonly has only one)website_index_status - network state of loading the index page. This is resolved by the Chromed DevTools Protocol.
DNS_ERROR - domain cannot be resolvedOK - all fineREDIRECT - domain redirect to somewhere elseTIMEOUT - the request timed outBAD_CONTENT_TYPE - 415 Unsupported Media TypeHTTP_ERROR - 404 errorTCP_ERROR - error in the network connectionUNKNOWN_ERROR - unknown errorwebsite_lang - language of index page detected based on langdetect librarywebsite_url - the URL of the website sampled from the CrUX list (may contain subdomains, etc). Use this as a unique identifier for connecting data between months.job_domain_status - indicates the status of loading the index page. Can be:
OK - all works well (at the moment, should be all entries)BLACKLISTED - URL is on our list of blocked URLsUNSAFE - website is not safe according to save browsing API by GoogleLOCATION_BLOCKED - country is in the list of blocked countriesjob_started_at - when the visit of the website was startedjob_ended_at - when the visit of the website was endedjob_crux_popularity - JSON with all popularity ranks of the website this monthjob_index_redirect - when we detect that the domain redirects us, we stop the crawl and create a new job with the target URL. This saves time if many websites redirect to one target, as it will be crawled only once. The index_redirect is then the job.id corresponding to the redirect target.job_num_starts - amount of crawlers that started this job (counts restarts in case of unsuccessful crawl, max is 3)job_from_static - whether this job was included in the static selection (see Sec. 3.3 of the paper)job_from_dynamic - whether this job was included in the dynamic selection (see Sec. 3.3 of the paper) - this is not exclusive with from_static - both can be true when the lists overlap.job_crawl_name - our name of the crawl, contains year and month (e.g., 'regular-2024-12' for regular crawls, in Dec 2024)policy_url_id - ID of the URL this policy haspolicy_keyword_score - score (higher is better) according to the crawler's keywords list that given document is a policypolicy_ml_probability - probability assigned by the BERT model that given document is a policypolicy_consideration_basis - on which basis we decided that this url is policy. The following three options are executed by the crawler in this order:
policy_url - full URL to the policypolicy_content_hash - used as identifier - if the document remained the same between crawls, it won't create a new entrypolicy_content - contains the text of policies and terms extracted to Markdown using Mozilla's readability librarypolicy_lang - Language detected by fasttext of the contentAnalogous to policy data, just substitute policy to terms.
Check this Google Docs for an updated version of this README.md.
Facebook
TwitterThis statistic shows the results of a survey conducted by Cint on the distribution of websites regularly visited in Sweden in 2017 and 2018. In 2018, ***** percent of respondents stated that they visit Google regularly.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This anonymized data set consists of one month's (October 2018) web tracking data of 2,148 German users. For each user, the data contains the anonymized URL of the webpage the user visited, the domain of the webpage, category of the domain, which provides 41 distinct categories. In total, these 2,148 users made 9,151,243 URL visits, spanning 49,918 unique domains. For each user in our data set, we have self-reported information (collected via a survey) about their gender and age.
We acknowledge the support of Respondi AG, which provided the web tracking and survey data free of charge for research purposes, with special thanks to François Erner and Luc Kalaora at Respondi for their insights and help with data extraction.
The data set is analyzed in the following paper:
The code used to analyze the data is also available at https://github.com/gesiscss/web_tracking.
If you use data or code from this repository, please cite the paper above and the Zenodo link.
Users are advised that some domains in this data set may link to potentially questionable or inappropriate content. The domains have not been individually reviewed, as content verification was not the primary objective of this data set. Therefore, user discretion is strongly recommended when accessing or scraping any content from these domains.
Facebook
TwitterThis statistic shows the results of a survey conducted by Cint on the distribution of websites regularly visited in Norway in 2017 and 2018. In 2018, **** percent of respondents stated that they visit YouTube regularly.
Facebook
TwitterUnlock the Power of Behavioural Data with GDPR-Compliant Clickstream Insights.
Swash clickstream data offers a comprehensive and GDPR-compliant dataset sourced from users worldwide, encompassing both desktop and mobile browsing behaviour. Here's an in-depth look at what sets us apart and how our data can benefit your organisation.
User-Centric Approach: Unlike traditional data collection methods, we take a user-centric approach by rewarding users for the data they willingly provide. This unique methodology ensures transparent data collection practices, encourages user participation, and establishes trust between data providers and consumers.
Wide Coverage and Varied Categories: Our clickstream data covers diverse categories, including search, shopping, and URL visits. Whether you are interested in understanding user preferences in e-commerce, analysing search behaviour across different industries, or tracking website visits, our data provides a rich and multi-dimensional view of user activities.
GDPR Compliance and Privacy: We prioritise data privacy and strictly adhere to GDPR guidelines. Our data collection methods are fully compliant, ensuring the protection of user identities and personal information. You can confidently leverage our clickstream data without compromising privacy or facing regulatory challenges.
Market Intelligence and Consumer Behaviuor: Gain deep insights into market intelligence and consumer behaviour using our clickstream data. Understand trends, preferences, and user behaviour patterns by analysing the comprehensive user-level, time-stamped raw or processed data feed. Uncover valuable information about user journeys, search funnels, and paths to purchase to enhance your marketing strategies and drive business growth.
High-Frequency Updates and Consistency: We provide high-frequency updates and consistent user participation, offering both historical data and ongoing daily delivery. This ensures you have access to up-to-date insights and a continuous data feed for comprehensive analysis. Our reliable and consistent data empowers you to make accurate and timely decisions.
Custom Reporting and Analysis: We understand that every organisation has unique requirements. That's why we offer customisable reporting options, allowing you to tailor the analysis and reporting of clickstream data to your specific needs. Whether you need detailed metrics, visualisations, or in-depth analytics, we provide the flexibility to meet your reporting requirements.
Data Quality and Credibility: We take data quality seriously. Our data sourcing practices are designed to ensure responsible and reliable data collection. We implement rigorous data cleaning, validation, and verification processes, guaranteeing the accuracy and reliability of our clickstream data. You can confidently rely on our data to drive your decision-making processes.
Facebook
TwitterUnlock the Potential of Your Web Traffic with Advanced Data Resolution
In the digital age, understanding and leveraging web traffic data is crucial for businesses aiming to thrive online. Our pioneering solution transforms anonymous website visits into valuable B2B and B2C contact data, offering unprecedented insights into your digital audience. By integrating our unique tag into your website, you unlock the capability to convert 25-50% of your anonymous traffic into actionable contact rows, directly deposited into an S3 bucket for your convenience. This process, known as "Web Traffic Data Resolution," is at the forefront of digital marketing and sales strategies, providing a competitive edge in understanding and engaging with your online visitors.
Comprehensive Web Traffic Data Resolution Our product stands out by offering a robust solution for "Web Traffic Data Resolution," a process that demystifies the identities behind your website traffic. By deploying a simple tag on your site, our technology goes to work, analyzing visitor behavior and leveraging proprietary data matching techniques to reveal the individuals and businesses behind the clicks. This innovative approach not only enhances your data collection but does so with respect for privacy and compliance standards, ensuring that your business gains insights ethically and responsibly.
Deep Dive into Web Traffic Data At the core of our solution is the sophisticated analysis of "Web Traffic Data." Our system meticulously collects and processes every interaction on your site, from page views to time spent on each section. This data, once anonymous and perhaps seen as abstract numbers, is transformed into a detailed ledger of potential leads and customer insights. By understanding who visits your site, their interests, and their contact information, your business is equipped to tailor marketing efforts, personalize customer experiences, and streamline sales processes like never before.
Benefits of Our Web Traffic Data Resolution Service Enhanced Lead Generation: By converting anonymous visitors into identifiable contact data, our service significantly expands your pool of potential leads. This direct enhancement of your lead generation efforts can dramatically increase conversion rates and ROI on marketing campaigns.
Targeted Marketing Campaigns: Armed with detailed B2B and B2C contact data, your marketing team can create highly targeted and personalized campaigns. This precision in marketing not only improves engagement rates but also ensures that your messaging resonates with the intended audience.
Improved Customer Insights: Gaining a deeper understanding of your web traffic enables your business to refine customer personas and tailor offerings to meet market demands. These insights are invaluable for product development, customer service improvement, and strategic planning.
Competitive Advantage: In a digital landscape where understanding your audience can make or break your business, our Web Traffic Data Resolution service provides a significant competitive edge. By accessing detailed contact data that others in your industry may overlook, you position your business as a leader in customer engagement and data-driven strategies.
Seamless Integration and Accessibility: Our solution is designed for ease of use, requiring only the placement of a tag on your website to start gathering data. The contact rows generated are easily accessible in an S3 bucket, ensuring that you can integrate this data with your existing CRM systems and marketing tools without hassle.
How It Works: A Closer Look at the Process Our Web Traffic Data Resolution process is streamlined and user-friendly, designed to integrate seamlessly with your existing website infrastructure:
Tag Deployment: Implement our unique tag on your website with simple instructions. This tag is lightweight and does not impact your site's loading speed or user experience.
Data Collection and Analysis: As visitors navigate your site, our system collects web traffic data in real-time, analyzing behavior patterns, engagement metrics, and more.
Resolution and Transformation: Using advanced data matching algorithms, we resolve the collected web traffic data into identifiable B2B and B2C contact information.
Data Delivery: The resolved contact data is then securely transferred to an S3 bucket, where it is organized and ready for your access. This process occurs daily, ensuring you have the most up-to-date information at your fingertips.
Integration and Action: With the resolved data now in your possession, your business can take immediate action. From refining marketing strategies to enhancing customer experiences, the possibilities are endless.
Security and Privacy: Our Commitment Understanding the sensitivity of web traffic data and contact information, our solution is built with security and privacy at its core. We adhere to strict data protection regulat...
Facebook
TwitterApproximately **** percent of internet users surveyed in Peru said that they had accessed news websites and apps from February 20, 2020 to March 5, 2020. Once the first case of COVID-19 in the country was reported, on March 6, 2020, until March 21, 2020, more than ** percent of the respondents stated that they had visited online news platforms. Meanwhile, the share of interviewees who said they had visited shopping websites and apps decreased *** percentage points, from **** percent before the coronavirus outbreak in Peru to **** percent afterwards.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
Citizen respondents rank how they want to interact with and consume government data. Survey responses are broken down along several dimensions including, Region, Education Level, Gender and Household (HH) Income.
Facebook
TwitterThis Dataset shows the Alexa Top 100 International Websites, and provides metrics on the volume of traffic that these sites were able to handle. The Alexa top 100 lists the 100 most visited websites in the world and measures various statistical information. I have looked up the Headquarters, either through alexa, or a Whois Lookup to get street address with i was then able to geocode. I was only able to successfully geocode 85 of the top 100 sites throughout the world. Source of Data was Alexa.com, Source URL: http://www.alexa.com/site/ds/top_sites?ts_mode=global&lang=none Data was from October 12, 2007. Alexa is updated daily so to get more up to date information visit their site directly. they don't have maps though.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Measuring and characterizing web page performance is a challenging task. When it comes to the mobile world, the highly varying technology characteristics coupled with the opaque network configuration make it even more difficult. Aiming at reproducibility, we present a large scale measurements study of web page performance collected in eleven commercial mobile networks spanning four countries. We build a dataset of nearly two million web browsing sessions to we shed light on the impact of different web protocols, browsers, and mobile technologies on the web performance. We find that the impact of mobile broadband access is sizeable. For example, the median page load time using mobile broadband increases by a third compared to wired access. Mobility clearly stresses the system, with handover causing the most evident performance penalties. Contrariwise, our measurements show that the adoption of HTTP/2 and QUIC has practically negligible impact. Our work highlights the importance of large-scale measurements. Even with our controlled setup, the complexity of the mobile web ecosystem is challenging to untangle. For this, we are releasing the dataset as open data for validation and further research. We also release together with the datasets we collected the scripts we use to produce the analysis we present in the paper. Please use plot_all.sh script to generate the plots in the paper, using the separate scripts from the "scripts" archive. Should you use any of these resources, please also make an attribution using the following reference (provided here in bibtex format): @inproceedings{rajiullah2019web, title={{Web Experience in Mobile Networks: Lessons from Two Million Page Visits}}, author={Rajiullah, Mohammad and Lutu, Andra and Khatouni, Ali Safari and Fida, Mah-Rukh and Mellia, Marco and Brunstrom, Anna and Alay, Ozgu and Alfredsson, Stefan and Mancuso, Vincenzo}, booktitle={The World Wide Web Conference}, pages={1532--1543}, year={2019}, organization={ACM}, address = {San Francisco, CA, USA}, keywords = {Web Experience, HTTP2, QUIC, TCP, Mobile Broadband, Measurements} }
Facebook
TwitterOpenWeb Ninja's Google Images Data (Google SERP Data) API provides real-time image search capabilities for images sourced from all public sources on the web.
The API enables you to search and access more than 100 billion images from across the web including advanced filtering capabilities as supported by Google Advanced Image Search. The API provides Google Images Data (Google SERP Data) including details such as image URL, title, size information, thumbnail, source information, and more data points. The API supports advanced filtering and options such as file type, image color, usage rights, creation time, and more. In addition, any Advanced Google Search operators can be used with the API.
OpenWeb Ninja's Google Images Data & Google SERP Data API common use cases:
Creative Media Production: Enhance digital content with a vast array of real-time images, ensuring engaging and brand-aligned visuals for blogs, social media, and advertising.
AI Model Enhancement: Train and refine AI models with diverse, annotated images, improving object recognition and image classification accuracy.
Trend Analysis: Identify emerging market trends and consumer preferences through real-time visual data, enabling proactive business decisions.
Innovative Product Design: Inspire product innovation by exploring current design trends and competitor products, ensuring market-relevant offerings.
Advanced Search Optimization: Improve search engines and applications with enriched image datasets, providing users with accurate, relevant, and visually appealing search results.
OpenWeb Ninja's Annotated Imagery Data & Google SERP Data Stats & Capabilities:
100B+ Images: Access an extensive database of over 100 billion images.
Images Data from all Public Sources (Google SERP Data): Benefit from a comprehensive aggregation of image data from various public websites, ensuring a wide range of sources and perspectives.
Extensive Search and Filtering Capabilities: Utilize advanced search operators and filters to refine image searches by file type, color, usage rights, creation time, and more, making it easy to find exactly what you need.
Rich Data Points: Each image comes with more than 10 data points, including URL, title (annotation), size information, thumbnail, and source information, providing a detailed context for each image.
Facebook
TwitterThis statistic shows the results of a survey conducted by Cint on the distribution of websites regularly visited in Poland from 2016 to 2018. In 2017, ***** percent of respondents stated that they visit Facebook regularly.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this cross-sectional study, we extracted the uniform resource locator (URL) of each National Abortion Federation member facility on May 6, 2022. We visited each unique URL using webXray (Timothy Libert), which detects third-party tracking. For each web page, we recorded data transfers to third-party domains. Transfers typically include a user’s IP (internet protocol) address and the web page being visited. We also recorded the presence of third-party cookies, data stored on a user’s computer that can facilitate tracking across multiple websites.
Facebook
TwitterThe Internet has become such an important part of our every day life. It can be used to correspond with people across the world, a lot faster than send a letter in the mail. The Internet has a wealth of information that is available to anybody just by searching for it. Sometimes you get more information than you ever wanted to know and sometimes you just canit find the information.This paper only covers a small portion of the websites and their links that have geothermal information concerning reservoir engineering, enhanced geothermal systems and other aspects of geothermal. Some of the websites below are located in the US, international websites, geothermal associations, and websites where you can access publications. Most of the websites listed below also have links to other websites for even more information.
Facebook
Twitterhttp://reference.data.gov.uk/id/open-government-licencehttp://reference.data.gov.uk/id/open-government-licence
The Local Directgov web service gives you direct access to the functions that drive the local government services on the Directgov website, so that you can use them in your own websites and other computer applications. They allow you to obtain service data directly from Local Directgov's database, looking up a specific service URL for a local authority, or general contact details if one cannot be found. Alternatively you can use different web service methods to request specific information.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains a list of 3654 Dutch websites that we considered the most popular websites in 2015. This list served as whitelist for the Newstracker Research project in which we monitored the online web behaviour of a group of respondents.The research project 'The Newstracker' was a subproject of the NWO-funded project 'The New News Consumer: A User-Based Innovation Project to Meet Paradigmatic Change in News Use and Media Habits'.For the Newstracker project we aimed to understand the web behaviour of a group of respondents. We created custom-built software to monitor their web browsing behaviour on their laptops and desktops (please find the code in open access at https://github.com/NITechLabs/NewsTracker). For reasons of scale and privacy we created a whitelist with websites that were the most popular websites in 2015. We manually compiled this list by using data of DDMM, Alexa and own research. The dataset consists of 5 columns:- the URL- the type of website: We created a list of types of websites and each website has been manually labeled with 1 category- Nieuws-regio: When the category was 'News', we subdivided these websites in the regional focus: International, National or Local- Nieuws-onderwerp: Furthermore, each website under the category News was further subdivided in type of news website. For this we created an own list of news categories and manually coded each website- Bron: For each website we noted which source we used to find this website.The full description of the research design of the Newstracker including the set-up of this whitelist is included in the following article: Kleppe, M., Otte, M. (in print), 'Analysing & understanding news consumption patterns by tracking online user behaviour with a multimodal research design', Digital Scholarship in the Humanities, doi 10.1093/llc/fqx030.
Facebook
TwitterOpen Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
You have passwords for everything: your devices, your accounts (e.g. banking, social media, and email), and the websites you visit. By using passphrases or strong passwords you can protect your devices and information. Review the tips below to learn how you can create passphrases, strengthen your passwords, and avoid common mistakes that could put your information at risk.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General data recollected for the studio " Analysis of the Quantitative Impact of Social Networks on Web Traffic of Cybermedia in the 27 Countries of the European Union".
Four research questions are posed: what percentage of the total web traffic generated by cybermedia in the European Union comes from social networks? Is said percentage higher or lower than that provided through direct traffic and through the use of search engines via SEO positioning? Which social networks have a greater impact? And is there any degree of relationship between the specific weight of social networks in the web traffic of a cybermedia and circumstances such as the average duration of the user's visit, the number of page views or the bounce rate understood in its formal aspect of not performing any kind of interaction on the visited page beyond reading its content?
To answer these questions, we have first proceeded to a selection of the cybermedia with the highest web traffic of the 27 countries that are currently part of the European Union after the United Kingdom left on December 31, 2020. In each nation we have selected five media using a combination of the global web traffic metrics provided by the tools Alexa (https://www.alexa.com/), which ceased to be operational on May 1, 2022, and SimilarWeb (https:// www.similarweb.com/). We have not used local metrics by country since the results obtained with these first two tools were sufficiently significant and our objective is not to establish a ranking of cybermedia by nation but to examine the relevance of social networks in their web traffic.
In all cases, cybermedia whose property corresponds to a journalistic company have been selected, ruling out those belonging to telecommunications portals or service providers; in some cases they correspond to classic information companies (both newspapers and televisions) while in others they refer to digital natives, without this circumstance affecting the nature of the research proposed.
Below we have proceeded to examine the web traffic data of said cybermedia. The period corresponding to the months of October, November and December 2021 and January, February and March 2022 has been selected. We believe that this six-month stretch allows possible one-time variations to be overcome for a month, reinforcing the precision of the data obtained.
To secure this data, we have used the SimilarWeb tool, currently the most precise tool that exists when examining the web traffic of a portal, although it is limited to that coming from desktops and laptops, without taking into account those that come from mobile devices, currently impossible to determine with existing measurement tools on the market.
It includes:
Web traffic general data: average visit duration, pages per visit and bounce rate Web traffic origin by country Percentage of traffic generated from social media over total web traffic Distribution of web traffic generated from social networks Comparison of web traffic generated from social netwoks with direct and search procedures
Facebook
TwitterOpen Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
The Local Directgov web service gives you direct access to the functions that drive the local government services on the Directgov website, so that you can use them in your own websites and other computer applications. They allow you to obtain service data directly from Local Directgov's database, looking up a specific service URL for a local authority, or general contact details if one cannot be found. Alternatively you can use different web service methods to request specific information.
Facebook
TwitterDATAANT provides the ability to extract travel data from public sources like: - Hotel websites - Flight aggregators - Homestay marketplaces - Experience marketplaces - Online Travel Agencies (OTA) and any open travel industry website you need.
Forecast travel trends with Booking.com, Airbnb, and travel aggregators data.
We support providing both raw and structured data with various delivery methods.
Get the competitive advantage of hospitality and travel Intelligence by scheduled data extractions and receive your data right to your inbox.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
This dataset supplements publication "Multilingual Scraper of Privacy Policies and Terms of Service" at ACM CSLAW’25, March 25–27, 2025, München, Germany. It includes the first 12 months of scraped policies and terms from about 800k websites, see concrete numbers below.
The following table lists the amount of websites visited per month:
| Month | Number of websites |
|---|---|
| 2024-01 | 551'148 |
| 2024-02 | 792'921 |
| 2024-03 | 844'537 |
| 2024-04 | 802'169 |
| 2024-05 | 805'878 |
| 2024-06 | 809'518 |
| 2024-07 | 811'418 |
| 2024-08 | 813'534 |
| 2024-09 | 814'321 |
| 2024-10 | 817'586 |
| 2024-11 | 828'662 |
| 2024-12 | 827'101 |
The amount of websites visited should always be higher than the number of jobs (Table 1 of the paper) as a website may redirect, resulting in two websites scraped or it has to be retried.
To simplify the access, we release the data in large CSVs. Namely, there is one file for policies and another for terms per month. All of these files contain all metadata that are usable for the analysis. If your favourite CSV parser reports the same numbers as above then our dataset is correctly parsed. We use ‘,’ as a separator, the first row is the heading and strings are in quotes.
Since our scraper sometimes collects other documents than policies and terms (for how often this happens, see the evaluation in Sec. 4 of the publication) that might contain personal data such as addresses of authors of websites that they maintain only for a selected audience. We therefore decided to reduce the risks for websites by anonymizing the data using Presidio. Presidio substitutes personal data with tokens. If your personal data has not been effectively anonymized from the database and you wish for it to be deleted, please contact us.
The uncompressed dataset is about 125 GB in size, so you will need sufficient storage. This also means that you likely cannot process all the data at once in your memory, so we split the data in months and in files for policies and terms.
The files have the following names:
Both files contain the following metadata columns:
website_month_id - identification of crawled websitejob_id - one website can have multiple jobs in case of redirects (but most commonly has only one)website_index_status - network state of loading the index page. This is resolved by the Chromed DevTools Protocol.
DNS_ERROR - domain cannot be resolvedOK - all fineREDIRECT - domain redirect to somewhere elseTIMEOUT - the request timed outBAD_CONTENT_TYPE - 415 Unsupported Media TypeHTTP_ERROR - 404 errorTCP_ERROR - error in the network connectionUNKNOWN_ERROR - unknown errorwebsite_lang - language of index page detected based on langdetect librarywebsite_url - the URL of the website sampled from the CrUX list (may contain subdomains, etc). Use this as a unique identifier for connecting data between months.job_domain_status - indicates the status of loading the index page. Can be:
OK - all works well (at the moment, should be all entries)BLACKLISTED - URL is on our list of blocked URLsUNSAFE - website is not safe according to save browsing API by GoogleLOCATION_BLOCKED - country is in the list of blocked countriesjob_started_at - when the visit of the website was startedjob_ended_at - when the visit of the website was endedjob_crux_popularity - JSON with all popularity ranks of the website this monthjob_index_redirect - when we detect that the domain redirects us, we stop the crawl and create a new job with the target URL. This saves time if many websites redirect to one target, as it will be crawled only once. The index_redirect is then the job.id corresponding to the redirect target.job_num_starts - amount of crawlers that started this job (counts restarts in case of unsuccessful crawl, max is 3)job_from_static - whether this job was included in the static selection (see Sec. 3.3 of the paper)job_from_dynamic - whether this job was included in the dynamic selection (see Sec. 3.3 of the paper) - this is not exclusive with from_static - both can be true when the lists overlap.job_crawl_name - our name of the crawl, contains year and month (e.g., 'regular-2024-12' for regular crawls, in Dec 2024)policy_url_id - ID of the URL this policy haspolicy_keyword_score - score (higher is better) according to the crawler's keywords list that given document is a policypolicy_ml_probability - probability assigned by the BERT model that given document is a policypolicy_consideration_basis - on which basis we decided that this url is policy. The following three options are executed by the crawler in this order:
policy_url - full URL to the policypolicy_content_hash - used as identifier - if the document remained the same between crawls, it won't create a new entrypolicy_content - contains the text of policies and terms extracted to Markdown using Mozilla's readability librarypolicy_lang - Language detected by fasttext of the contentAnalogous to policy data, just substitute policy to terms.
Check this Google Docs for an updated version of this README.md.