100+ datasets found

Multilingual Scraper of Privacy Policies and Terms of Service
zenodo.org
bin, zip
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Bernhard; David Bernhard; Luka Nenadic; Luka Nenadic; Stefan Bechtold; Karel Kubicek; Karel Kubicek; Stefan Bechtold (2025). Multilingual Scraper of Privacy Policies and Terms of Service [Dataset]. http://doi.org/10.5281/zenodo.14562039
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14562039
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
David Bernhard; David Bernhard; Luka Nenadic; Luka Nenadic; Stefan Bechtold; Karel Kubicek; Karel Kubicek; Stefan Bechtold
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Multilingual Scraper of Privacy Policies and Terms of Service: Scraped Documents of 2024

This dataset supplements publication "Multilingual Scraper of Privacy Policies and Terms of Service" at ACM CSLAW’25, March 25–27, 2025, München, Germany. It includes the first 12 months of scraped policies and terms from about 800k websites, see concrete numbers below.

The following table lists the amount of websites visited per month:

Month Number of websites
2024-01 551'148
2024-02 792'921
2024-03 844'537
2024-04 802'169
2024-05 805'878
2024-06 809'518
2024-07 811'418
2024-08 813'534
2024-09 814'321
2024-10 817'586
2024-11 828'662
2024-12 827'101

The amount of websites visited should always be higher than the number of jobs (Table 1 of the paper) as a website may redirect, resulting in two websites scraped or it has to be retried.

To simplify the access, we release the data in large CSVs. Namely, there is one file for policies and another for terms per month. All of these files contain all metadata that are usable for the analysis. If your favourite CSV parser reports the same numbers as above then our dataset is correctly parsed. We use ‘,’ as a separator, the first row is the heading and strings are in quotes.

Since our scraper sometimes collects other documents than policies and terms (for how often this happens, see the evaluation in Sec. 4 of the publication) that might contain personal data such as addresses of authors of websites that they maintain only for a selected audience. We therefore decided to reduce the risks for websites by anonymizing the data using Presidio. Presidio substitutes personal data with tokens. If your personal data has not been effectively anonymized from the database and you wish for it to be deleted, please contact us.

Preliminaries

The uncompressed dataset is about 125 GB in size, so you will need sufficient storage. This also means that you likely cannot process all the data at once in your memory, so we split the data in months and in files for policies and terms.

Files and structure

The files have the following names:

2024_policy.csv for policies

2024_terms.csv for terms

Shared metadata

Both files contain the following metadata columns:

website_month_id - identification of crawled website

job_id - one website can have multiple jobs in case of redirects (but most commonly has only one)

website_index_status - network state of loading the index page. This is resolved by the Chromed DevTools Protocol.

DNS_ERROR - domain cannot be resolved

OK - all fine

REDIRECT - domain redirect to somewhere else

TIMEOUT - the request timed out

BAD_CONTENT_TYPE - 415 Unsupported Media Type

HTTP_ERROR - 404 error

TCP_ERROR - error in the network connection

UNKNOWN_ERROR - unknown error

website_lang - language of index page detected based on langdetect library

website_url - the URL of the website sampled from the CrUX list (may contain subdomains, etc). Use this as a unique identifier for connecting data between months.

job_domain_status - indicates the status of loading the index page. Can be:

OK - all works well (at the moment, should be all entries)

BLACKLISTED - URL is on our list of blocked URLs

UNSAFE - website is not safe according to save browsing API by Google

LOCATION_BLOCKED - country is in the list of blocked countries

job_started_at - when the visit of the website was started

job_ended_at - when the visit of the website was ended

job_crux_popularity - JSON with all popularity ranks of the website this month

job_index_redirect - when we detect that the domain redirects us, we stop the crawl and create a new job with the target URL. This saves time if many websites redirect to one target, as it will be crawled only once. The index_redirect is then the job.id corresponding to the redirect target.

job_num_starts - amount of crawlers that started this job (counts restarts in case of unsuccessful crawl, max is 3)

job_from_static - whether this job was included in the static selection (see Sec. 3.3 of the paper)

job_from_dynamic - whether this job was included in the dynamic selection (see Sec. 3.3 of the paper) - this is not exclusive with from_static - both can be true when the lists overlap.

job_crawl_name - our name of the crawl, contains year and month (e.g., 'regular-2024-12' for regular crawls, in Dec 2024)

Policy data

policy_url_id - ID of the URL this policy has

policy_keyword_score - score (higher is better) according to the crawler's keywords list that given document is a policy

policy_ml_probability - probability assigned by the BERT model that given document is a policy

policy_consideration_basis - on which basis we decided that this url is policy. The following three options are executed by the crawler in this order:

'keyword matching' - this policy was found using the crawler navigation (which is based on keywords)

'search' - this policy was found using search engine

'path guessing' - this policy was found by using well-known URLs like example.com/policy

policy_url - full URL to the policy

policy_content_hash - used as identifier - if the document remained the same between crawls, it won't create a new entry

policy_content - contains the text of policies and terms extracted to Markdown using Mozilla's readability library

policy_lang - Language detected by fasttext of the content

Terms data

Analogous to policy data, just substitute policy to terms.

Updates

Check this Google Docs for an updated version of this README.md.
Distribution of websites regularly visited in Sweden 2017-2018
statista.com
Updated Jul 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Distribution of websites regularly visited in Sweden 2017-2018 [Dataset]. https://www.statista.com/statistics/570094/distribution-of-websites-visited-regularly-in-sweden/
Explore at:
Dataset updated
Jul 7, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Sweden
Description
This statistic shows the results of a survey conducted by Cint on the distribution of websites regularly visited in Sweden in 2017 and 2018. In 2018, ***** percent of respondents stated that they visit Google regularly.
A web tracking data set of online browsing behavior of 2,148 users
zenodo.org
data.niaid.nih.gov
application/gzip, txt +1
Updated Oct 9, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juhi Kulshrestha; Juhi Kulshrestha; Marcos Oliveira; Marcos Oliveira; Orkut Karacalik; Denis Bonnay; Claudia Wagner; Orkut Karacalik; Denis Bonnay; Claudia Wagner (2025). A web tracking data set of online browsing behavior of 2,148 users [Dataset]. http://doi.org/10.5281/zenodo.4757574
Explore at:
zip, txt, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4757574
Dataset updated
Oct 9, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juhi Kulshrestha; Juhi Kulshrestha; Marcos Oliveira; Marcos Oliveira; Orkut Karacalik; Denis Bonnay; Claudia Wagner; Orkut Karacalik; Denis Bonnay; Claudia Wagner
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This anonymized data set consists of one month's (October 2018) web tracking data of 2,148 German users. For each user, the data contains the anonymized URL of the webpage the user visited, the domain of the webpage, category of the domain, which provides 41 distinct categories. In total, these 2,148 users made 9,151,243 URL visits, spanning 49,918 unique domains. For each user in our data set, we have self-reported information (collected via a survey) about their gender and age.

We acknowledge the support of Respondi AG, which provided the web tracking and survey data free of charge for research purposes, with special thanks to François Erner and Luc Kalaora at Respondi for their insights and help with data extraction.

The data set is analyzed in the following paper:

Kulshrestha, J., Oliveira, M., Karacalik, O., Bonnay, D., Wagner, C. "Web Routineness and Limits of Predictability: Investigating Demographic and Behavioral Differences Using Web Tracking Data." Proceedings of the International AAAI Conference on Web and Social Media. 2021. https://arxiv.org/abs/2012.15112.

The code used to analyze the data is also available at https://github.com/gesiscss/web_tracking.

If you use data or code from this repository, please cite the paper above and the Zenodo link.

Users are advised that some domains in this data set may link to potentially questionable or inappropriate content. The domains have not been individually reviewed, as content verification was not the primary objective of this data set. Therefore, user discretion is strongly recommended when accessing or scraping any content from these domains.
Distribution of websites regularly visited in Norway 2017-2018
statista.com
Updated Jul 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Distribution of websites regularly visited in Norway 2017-2018 [Dataset]. https://www.statista.com/statistics/570058/distribution-of-websites-visited-regularly-in-norway/
Explore at:
Dataset updated
Jul 11, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Norway
Description
This statistic shows the results of a survey conducted by Cint on the distribution of websites regularly visited in Norway in 2017 and 2018. In 2018, **** percent of respondents stated that they visit YouTube regularly.
d
Swash Web Browsing Clickstream Data - 1.5M Worldwide Users - GDPR Compliant
datarade.ai
.csv, .xls
Updated Jun 27, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Swash (2023). Swash Web Browsing Clickstream Data - 1.5M Worldwide Users - GDPR Compliant [Dataset]. https://datarade.ai/data-products/swash-blockchain-bitcoin-and-web3-enthusiasts-swash
Explore at:
.csv, .xlsAvailable download formats
Dataset updated
Jun 27, 2023
Dataset authored and provided by
Swash
Area covered
Saint Vincent and the Grenadines, Latvia, Jordan, Belarus, Jamaica, Uzbekistan, Monaco, Liechtenstein, Russian Federation, India
Description
Unlock the Power of Behavioural Data with GDPR-Compliant Clickstream Insights.

Swash clickstream data offers a comprehensive and GDPR-compliant dataset sourced from users worldwide, encompassing both desktop and mobile browsing behaviour. Here's an in-depth look at what sets us apart and how our data can benefit your organisation.

User-Centric Approach: Unlike traditional data collection methods, we take a user-centric approach by rewarding users for the data they willingly provide. This unique methodology ensures transparent data collection practices, encourages user participation, and establishes trust between data providers and consumers.

Wide Coverage and Varied Categories: Our clickstream data covers diverse categories, including search, shopping, and URL visits. Whether you are interested in understanding user preferences in e-commerce, analysing search behaviour across different industries, or tracking website visits, our data provides a rich and multi-dimensional view of user activities.

GDPR Compliance and Privacy: We prioritise data privacy and strictly adhere to GDPR guidelines. Our data collection methods are fully compliant, ensuring the protection of user identities and personal information. You can confidently leverage our clickstream data without compromising privacy or facing regulatory challenges.

Market Intelligence and Consumer Behaviuor: Gain deep insights into market intelligence and consumer behaviour using our clickstream data. Understand trends, preferences, and user behaviour patterns by analysing the comprehensive user-level, time-stamped raw or processed data feed. Uncover valuable information about user journeys, search funnels, and paths to purchase to enhance your marketing strategies and drive business growth.

High-Frequency Updates and Consistency: We provide high-frequency updates and consistent user participation, offering both historical data and ongoing daily delivery. This ensures you have access to up-to-date insights and a continuous data feed for comprehensive analysis. Our reliable and consistent data empowers you to make accurate and timely decisions.

Custom Reporting and Analysis: We understand that every organisation has unique requirements. That's why we offer customisable reporting options, allowing you to tailor the analysis and reporting of clickstream data to your specific needs. Whether you need detailed metrics, visualisations, or in-depth analytics, we provide the flexibility to meet your reporting requirements.

Data Quality and Credibility: We take data quality seriously. Our data sourcing practices are designed to ensure responsible and reliable data collection. We implement rigorous data cleaning, validation, and verification processes, guaranteeing the accuracy and reliability of our clickstream data. You can confidently rely on our data to drive your decision-making processes.
d
Web Traffic Data | 500M+ US Web Traffic Data Resolution | B2B and B2C...
datarade.ai
.csv, .xls
Updated Feb 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Allforce (2025). Web Traffic Data | 500M+ US Web Traffic Data Resolution | B2B and B2C Website Visitor Identity Resolution [Dataset]. https://datarade.ai/data-products/traffic-continuum-from-solution-publishing-500m-us-web-traf-solution-publishing
Explore at:
.csv, .xlsAvailable download formats
Dataset updated
Feb 24, 2025
Dataset authored and provided by
Allforce
Area covered
United States of America
Description
Unlock the Potential of Your Web Traffic with Advanced Data Resolution

In the digital age, understanding and leveraging web traffic data is crucial for businesses aiming to thrive online. Our pioneering solution transforms anonymous website visits into valuable B2B and B2C contact data, offering unprecedented insights into your digital audience. By integrating our unique tag into your website, you unlock the capability to convert 25-50% of your anonymous traffic into actionable contact rows, directly deposited into an S3 bucket for your convenience. This process, known as "Web Traffic Data Resolution," is at the forefront of digital marketing and sales strategies, providing a competitive edge in understanding and engaging with your online visitors.

Comprehensive Web Traffic Data Resolution Our product stands out by offering a robust solution for "Web Traffic Data Resolution," a process that demystifies the identities behind your website traffic. By deploying a simple tag on your site, our technology goes to work, analyzing visitor behavior and leveraging proprietary data matching techniques to reveal the individuals and businesses behind the clicks. This innovative approach not only enhances your data collection but does so with respect for privacy and compliance standards, ensuring that your business gains insights ethically and responsibly.

Deep Dive into Web Traffic Data At the core of our solution is the sophisticated analysis of "Web Traffic Data." Our system meticulously collects and processes every interaction on your site, from page views to time spent on each section. This data, once anonymous and perhaps seen as abstract numbers, is transformed into a detailed ledger of potential leads and customer insights. By understanding who visits your site, their interests, and their contact information, your business is equipped to tailor marketing efforts, personalize customer experiences, and streamline sales processes like never before.

Benefits of Our Web Traffic Data Resolution Service Enhanced Lead Generation: By converting anonymous visitors into identifiable contact data, our service significantly expands your pool of potential leads. This direct enhancement of your lead generation efforts can dramatically increase conversion rates and ROI on marketing campaigns.

Targeted Marketing Campaigns: Armed with detailed B2B and B2C contact data, your marketing team can create highly targeted and personalized campaigns. This precision in marketing not only improves engagement rates but also ensures that your messaging resonates with the intended audience.

Improved Customer Insights: Gaining a deeper understanding of your web traffic enables your business to refine customer personas and tailor offerings to meet market demands. These insights are invaluable for product development, customer service improvement, and strategic planning.

Competitive Advantage: In a digital landscape where understanding your audience can make or break your business, our Web Traffic Data Resolution service provides a significant competitive edge. By accessing detailed contact data that others in your industry may overlook, you position your business as a leader in customer engagement and data-driven strategies.

Seamless Integration and Accessibility: Our solution is designed for ease of use, requiring only the placement of a tag on your website to start gathering data. The contact rows generated are easily accessible in an S3 bucket, ensuring that you can integrate this data with your existing CRM systems and marketing tools without hassle.

How It Works: A Closer Look at the Process Our Web Traffic Data Resolution process is streamlined and user-friendly, designed to integrate seamlessly with your existing website infrastructure:

Tag Deployment: Implement our unique tag on your website with simple instructions. This tag is lightweight and does not impact your site's loading speed or user experience.

Data Collection and Analysis: As visitors navigate your site, our system collects web traffic data in real-time, analyzing behavior patterns, engagement metrics, and more.

Resolution and Transformation: Using advanced data matching algorithms, we resolve the collected web traffic data into identifiable B2B and B2C contact information.

Data Delivery: The resolved contact data is then securely transferred to an S3 bucket, where it is organized and ready for your access. This process occurs daily, ensuring you have the most up-to-date information at your fingertips.

Integration and Action: With the resolved data now in your possession, your business can take immediate action. From refining marketing strategies to enhancing customer experiences, the possibilities are endless.

Security and Privacy: Our Commitment Understanding the sensitivity of web traffic data and contact information, our solution is built with security and privacy at its core. We adhere to strict data protection regulat...
Peru: change in visits to websites due to COVID-19 2020, by category
statista.com
Updated Jul 8, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Peru: change in visits to websites due to COVID-19 2020, by category [Dataset]. https://www.statista.com/statistics/1108733/peru-website-visitors-change/
Explore at:
Dataset updated
Jul 8, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Feb 20, 2020 - Mar 21, 2020
Area covered
Peru
Description
Approximately **** percent of internet users surveyed in Peru said that they had accessed news websites and apps from February 20, 2020 to March 5, 2020. Once the first case of COVID-19 in the country was reported, on March 6, 2020, until March 21, 2020, more than ** percent of the respondents stated that they had visited online news platforms. Meanwhile, the share of interviewees who said they had visited shopping websites and apps decreased *** percentage points, from **** percent before the coronavirus outbreak in Peru to **** percent afterwards.
How Citizens Prefer to Access Data on Government Websites (Detail)
benchmarkstudy.socrata.com
csv, xlsx, xml
Updated Aug 21, 2011
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Socrata Open Government Data Benchmark Study (2011). How Citizens Prefer to Access Data on Government Websites (Detail) [Dataset]. https://benchmarkstudy.socrata.com/Public-Survey/How-Citizens-Prefer-to-Access-Data-on-Government-W/xkgk-r22k
Explore at:
xlsx, csv, xmlAvailable download formats
Dataset updated
Aug 21, 2011
Dataset provided by
Socratahttp://www.blist.com/
data.gov.inhttp://data.gov.in/
Authors
Socrata Open Government Data Benchmark Study
License
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
Description
Citizen respondents rank how they want to interact with and consume government data. Survey responses are broken down along several dimensions including, Region, Education Level, Gender and Household (HH) Income.
g
Alexa, International Top 100 Websites, Global, 10.12.2007
geocommons.com
Updated Apr 29, 2008
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexa (2008). Alexa, International Top 100 Websites, Global, 10.12.2007 [Dataset]. http://geocommons.com/search.html
Explore at:
Dataset updated
Apr 29, 2008
Dataset provided by
data
Alexa
Description
This Dataset shows the Alexa Top 100 International Websites, and provides metrics on the volume of traffic that these sites were able to handle. The Alexa top 100 lists the 100 most visited websites in the world and measures various statistical information. I have looked up the Headquarters, either through alexa, or a Whois Lookup to get street address with i was then able to geocode. I was only able to successfully geocode 85 of the top 100 sites throughout the world. Source of Data was Alexa.com, Source URL: http://www.alexa.com/site/ds/top_sites?ts_mode=global&lang=none Data was from October 12, 2007. Alexa is updated daily so to get more up to date information visit their site directly. they don't have maps though.
Data from: Web Experience in Mobile Networks: Lessons from Two Million Page...
data.europa.eu
unknown
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). Web Experience in Mobile Networks: Lessons from Two Million Page Visits [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-2574157?locale=nl
Explore at:
unknown(896)Available download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Measuring and characterizing web page performance is a challenging task. When it comes to the mobile world, the highly varying technology characteristics coupled with the opaque network configuration make it even more difficult. Aiming at reproducibility, we present a large scale measurements study of web page performance collected in eleven commercial mobile networks spanning four countries. We build a dataset of nearly two million web browsing sessions to we shed light on the impact of different web protocols, browsers, and mobile technologies on the web performance. We find that the impact of mobile broadband access is sizeable. For example, the median page load time using mobile broadband increases by a third compared to wired access. Mobility clearly stresses the system, with handover causing the most evident performance penalties. Contrariwise, our measurements show that the adoption of HTTP/2 and QUIC has practically negligible impact. Our work highlights the importance of large-scale measurements. Even with our controlled setup, the complexity of the mobile web ecosystem is challenging to untangle. For this, we are releasing the dataset as open data for validation and further research. We also release together with the datasets we collected the scripts we use to produce the analysis we present in the paper. Please use plot_all.sh script to generate the plots in the paper, using the separate scripts from the "scripts" archive. Should you use any of these resources, please also make an attribution using the following reference (provided here in bibtex format): @inproceedings{rajiullah2019web, title={{Web Experience in Mobile Networks: Lessons from Two Million Page Visits}}, author={Rajiullah, Mohammad and Lutu, Andra and Khatouni, Ali Safari and Fida, Mah-Rukh and Mellia, Marco and Brunstrom, Anna and Alay, Ozgu and Alfredsson, Stefan and Mancuso, Vincenzo}, booktitle={The World Wide Web Conference}, pages={1532--1543}, year={2019}, organization={ACM}, address = {San Francisco, CA, USA}, keywords = {Web Experience, HTTP2, QUIC, TCP, Mobile Broadband, Measurements} }
d
Google SERP Data, Web Search Data, Google Images Data | Real-Time API
datarade.ai
.json, .csv
Updated May 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenWeb Ninja (2024). Google SERP Data, Web Search Data, Google Images Data | Real-Time API [Dataset]. https://datarade.ai/data-products/openweb-ninja-google-data-google-image-data-google-serp-d-openweb-ninja
Explore at:
.json, .csvAvailable download formats
Dataset updated
May 17, 2024
Dataset authored and provided by
OpenWeb Ninja
Area covered
Uganda, Burundi, Panama, South Georgia and the South Sandwich Islands, Ireland, Barbados, Tokelau, Grenada, Virgin Islands (U.S.), Uruguay
Description
OpenWeb Ninja's Google Images Data (Google SERP Data) API provides real-time image search capabilities for images sourced from all public sources on the web.

The API enables you to search and access more than 100 billion images from across the web including advanced filtering capabilities as supported by Google Advanced Image Search. The API provides Google Images Data (Google SERP Data) including details such as image URL, title, size information, thumbnail, source information, and more data points. The API supports advanced filtering and options such as file type, image color, usage rights, creation time, and more. In addition, any Advanced Google Search operators can be used with the API.

OpenWeb Ninja's Google Images Data & Google SERP Data API common use cases:

Creative Media Production: Enhance digital content with a vast array of real-time images, ensuring engaging and brand-aligned visuals for blogs, social media, and advertising.

AI Model Enhancement: Train and refine AI models with diverse, annotated images, improving object recognition and image classification accuracy.

Trend Analysis: Identify emerging market trends and consumer preferences through real-time visual data, enabling proactive business decisions.

Innovative Product Design: Inspire product innovation by exploring current design trends and competitor products, ensuring market-relevant offerings.

Advanced Search Optimization: Improve search engines and applications with enriched image datasets, providing users with accurate, relevant, and visually appealing search results.

OpenWeb Ninja's Annotated Imagery Data & Google SERP Data Stats & Capabilities:

100B+ Images: Access an extensive database of over 100 billion images.

Images Data from all Public Sources (Google SERP Data): Benefit from a comprehensive aggregation of image data from various public websites, ensuring a wide range of sources and perspectives.

Extensive Search and Filtering Capabilities: Utilize advanced search operators and filters to refine image searches by file type, color, usage rights, creation time, and more, making it easy to find exactly what you need.

Rich Data Points: Each image comes with more than 10 data points, including URL, title (annotation), size information, thumbnail, and source information, providing a detailed context for each image.
Distribution of websites regularly visited in Poland 2016-2018
statista.com
Updated Jul 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Distribution of websites regularly visited in Poland 2016-2018 [Dataset]. https://www.statista.com/statistics/570073/distribution-of-websites-visited-regularly-in-poland/
Explore at:
Dataset updated
Jul 7, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Poland
Description
This statistic shows the results of a survey conducted by Cint on the distribution of websites regularly visited in Poland from 2016 to 2018. In 2017, ***** percent of respondents stated that they visit Facebook regularly.
Replication Data for "Prevalence of Third-Party Tracking on Abortion Clinic...
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ari B Friedman (2023). Replication Data for "Prevalence of Third-Party Tracking on Abortion Clinic Web Pages" [Dataset]. http://doi.org/10.6084/m9.figshare.21437970.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21437970.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Ari B Friedman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In this cross-sectional study, we extracted the uniform resource locator (URL) of each National Abortion Federation member facility on May 6, 2022. We visited each unique URL using webXray (Timothy Libert), which detects third-party tracking. For each web page, we recorded data transfers to third-party domains. Transfers typically include a user’s IP (internet protocol) address and the web page being visited. We also recorded the presence of third-party cookies, data stored on a user’s computer that can facilitate tracking across multiple websites.
w
Data from: Geothermal Websites
data.wu.ac.at
pdf
Updated Dec 4, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2017). Geothermal Websites [Dataset]. https://data.wu.ac.at/schema/geothermaldata_org/MDZmNDdkZjUtNGUxNC00MzY3LWIwOWUtOWQ3YTNiMjIxNjBh
Explore at:
pdfAvailable download formats
Dataset updated
Dec 4, 2017
Area covered
e34793041040ebeecf598817bd0fc648c3de349c
Description
The Internet has become such an important part of our every day life. It can be used to correspond with people across the world, a lot faster than send a letter in the mail. The Internet has a wealth of information that is available to anybody just by searching for it. Sometimes you get more information than you ever wanted to know and sometimes you just canit find the information.This paper only covers a small portion of the websites and their links that have geothermal information concerning reservoir engineering, enhanced geothermal systems and other aspects of geothermal. Some of the websites below are located in the US, international websites, geothermal associations, and websites where you can access publications. Most of the websites listed below also have links to other websites for even more information.
e
Local Directgov web service
data.europa.eu
data.wu.ac.at
xml
Updated Aug 7, 2011
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ministry of Housing, Communities and Local Government (2011). Local Directgov web service [Dataset]. https://data.europa.eu/data/datasets/local_directgov_web_service?locale=da
Explore at:
xmlAvailable download formats
Dataset updated
Aug 7, 2011
Dataset authored and provided by
Ministry of Housing, Communities and Local Government
License
http://reference.data.gov.uk/id/open-government-licencehttp://reference.data.gov.uk/id/open-government-licence
Description
The Local Directgov web service gives you direct access to the functions that drive the local government services on the Directgov website, so that you can use them in your own websites and other computer applications. They allow you to obtain service data directly from Local Directgov's database, looking up a specific service URL for a local authority, or general contact details if one cannot be found. Alternatively you can use different web service methods to request specific information.
D
Most popular websites in the Netherlands 2015
ssh.datastations.nl
csv, tsv, zip
Updated May 9, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M. Kleppe; H. Bijleveld; M. Kleppe; H. Bijleveld (2017). Most popular websites in the Netherlands 2015 [Dataset]. http://doi.org/10.17026/DANS-X6H-6QQT
Explore at:
zip(15855), csv(138294), tsv(176359)Available download formats
Unique identifier
https://doi.org/10.17026/DANS-X6H-6QQT
Dataset updated
May 9, 2017
Dataset provided by
DANS Data Station Social Sciences and Humanities
Authors
M. Kleppe; H. Bijleveld; M. Kleppe; H. Bijleveld
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Netherlands
Dataset funded by
NWO
Description
This dataset contains a list of 3654 Dutch websites that we considered the most popular websites in 2015. This list served as whitelist for the Newstracker Research project in which we monitored the online web behaviour of a group of respondents.The research project 'The Newstracker' was a subproject of the NWO-funded project 'The New News Consumer: A User-Based Innovation Project to Meet Paradigmatic Change in News Use and Media Habits'.For the Newstracker project we aimed to understand the web behaviour of a group of respondents. We created custom-built software to monitor their web browsing behaviour on their laptops and desktops (please find the code in open access at https://github.com/NITechLabs/NewsTracker). For reasons of scale and privacy we created a whitelist with websites that were the most popular websites in 2015. We manually compiled this list by using data of DDMM, Alexa and own research. The dataset consists of 5 columns:- the URL- the type of website: We created a list of types of websites and each website has been manually labeled with 1 category- Nieuws-regio: When the category was 'News', we subdivided these websites in the regional focus: International, National or Local- Nieuws-onderwerp: Furthermore, each website under the category News was further subdivided in type of news website. For this we created an own list of news categories and manually coded each website- Bron: For each website we noted which source we used to find this website.The full description of the research design of the Newstracker including the set-up of this whitelist is included in the following article: Kleppe, M., Otte, M. (in print), 'Analysing & understanding news consumption patterns by tracking online user behaviour with a multimodal research design', Digital Scholarship in the Humanities, doi 10.1093/llc/fqx030.
u
Best Practices for Passphrases and Passwords (ITSAP.30.032) - Catalogue -...
data.urbandatacentre.ca
Updated Sep 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Best Practices for Passphrases and Passwords (ITSAP.30.032) - Catalogue - Canadian Urban Data Catalogue (CUDC) [Dataset]. https://data.urbandatacentre.ca/dataset/gov-canada-f57e1d6d-6a60-456c-ba5a-a535d1252798
Explore at:
Dataset updated
Sep 30, 2024
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Area covered
Canada
Description
You have passwords for everything: your devices, your accounts (e.g. banking, social media, and email), and the websites you visit. By using passphrases or strong passwords you can protect your devices and information. Review the tips below to learn how you can create passphrases, strengthen your passwords, and avoid common mistakes that could put your information at risk.
Data from: Analysis of the Quantitative Impact of Social Networks General...
figshare.com
produccioncientifica.ucm.es
doc
Updated Oct 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Parra; Santiago Martínez Arias; Sergio Mena Muñoz (2022). Analysis of the Quantitative Impact of Social Networks General Data.doc [Dataset]. http://doi.org/10.6084/m9.figshare.21329421.v1
Explore at:
docAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21329421.v1
Dataset updated
Oct 14, 2022
Dataset provided by
Figsharehttp://figshare.com/
Authors
David Parra; Santiago Martínez Arias; Sergio Mena Muñoz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
General data recollected for the studio " Analysis of the Quantitative Impact of Social Networks on Web Traffic of Cybermedia in the 27 Countries of the European Union". Four research questions are posed: what percentage of the total web traffic generated by cybermedia in the European Union comes from social networks? Is said percentage higher or lower than that provided through direct traffic and through the use of search engines via SEO positioning? Which social networks have a greater impact? And is there any degree of relationship between the specific weight of social networks in the web traffic of a cybermedia and circumstances such as the average duration of the user's visit, the number of page views or the bounce rate understood in its formal aspect of not performing any kind of interaction on the visited page beyond reading its content? To answer these questions, we have first proceeded to a selection of the cybermedia with the highest web traffic of the 27 countries that are currently part of the European Union after the United Kingdom left on December 31, 2020. In each nation we have selected five media using a combination of the global web traffic metrics provided by the tools Alexa (https://www.alexa.com/), which ceased to be operational on May 1, 2022, and SimilarWeb (https:// www.similarweb.com/). We have not used local metrics by country since the results obtained with these first two tools were sufficiently significant and our objective is not to establish a ranking of cybermedia by nation but to examine the relevance of social networks in their web traffic. In all cases, cybermedia whose property corresponds to a journalistic company have been selected, ruling out those belonging to telecommunications portals or service providers; in some cases they correspond to classic information companies (both newspapers and televisions) while in others they refer to digital natives, without this circumstance affecting the nature of the research proposed.
Below we have proceeded to examine the web traffic data of said cybermedia. The period corresponding to the months of October, November and December 2021 and January, February and March 2022 has been selected. We believe that this six-month stretch allows possible one-time variations to be overcome for a month, reinforcing the precision of the data obtained. To secure this data, we have used the SimilarWeb tool, currently the most precise tool that exists when examining the web traffic of a portal, although it is limited to that coming from desktops and laptops, without taking into account those that come from mobile devices, currently impossible to determine with existing measurement tools on the market. It includes:

Web traffic general data: average visit duration, pages per visit and bounce rate Web traffic origin by country Percentage of traffic generated from social media over total web traffic Distribution of web traffic generated from social networks Comparison of web traffic generated from social netwoks with direct and search procedures
Local Directgov web service - Dataset - data.gov.uk
ckan.publishing.service.gov.uk
Updated Sep 27, 2011
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ckan.publishing.service.gov.uk (2011). Local Directgov web service - Dataset - data.gov.uk [Dataset]. https://ckan.publishing.service.gov.uk/dataset/local_directgov_web_service
Explore at:
Dataset updated
Sep 27, 2011
Dataset provided by
CKANhttps://ckan.org/
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Description
The Local Directgov web service gives you direct access to the functions that drive the local government services on the Directgov website, so that you can use them in your own websites and other computer applications. They allow you to obtain service data directly from Local Directgov's database, looking up a specific service URL for a local authority, or general contact details if one cannot be found. Alternatively you can use different web service methods to request specific information.
d
DATAANT | Travel Data | Dataset, API | Booking and Pricing Data: Hotel...
datarade.ai
Updated Mar 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataant (2023). DATAANT | Travel Data | Dataset, API | Booking and Pricing Data: Hotel Websites, Flight Aggregators and Rental Aggregators | Global Coverage [Dataset]. https://datarade.ai/data-products/dataant-travel-data-dataset-api-booking-and-pricing-da-dataant
Explore at:
.json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Mar 1, 2023
Dataset authored and provided by
Dataant
Area covered
Svalbard and Jan Mayen, Bulgaria, Honduras, Kyrgyzstan, Greece, Norfolk Island, Dominican Republic, Vietnam, Luxembourg, Saint Barthélemy
Description
DATAANT provides the ability to extract travel data from public sources like: - Hotel websites - Flight aggregators - Homestay marketplaces - Experience marketplaces - Online Travel Agencies (OTA) and any open travel industry website you need.

Forecast travel trends with Booking.com, Airbnb, and travel aggregators data.

We support providing both raw and structured data with various delivery methods.

Get the competitive advantage of hospitality and travel Intelligence by scheduled data extractions and receive your data right to your inbox.

Facebook

Twitter

Click to copy link

Link copied

Cite

David Bernhard; David Bernhard; Luka Nenadic; Luka Nenadic; Stefan Bechtold; Karel Kubicek; Karel Kubicek; Stefan Bechtold (2025). Multilingual Scraper of Privacy Policies and Terms of Service [Dataset]. http://doi.org/10.5281/zenodo.14562039

Multilingual Scraper of Privacy Policies and Terms of Service

Explore at:

zip, binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14562039

Dataset updated

Apr 24, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

David Bernhard; David Bernhard; Luka Nenadic; Luka Nenadic; Stefan Bechtold; Karel Kubicek; Karel Kubicek; Stefan Bechtold

License

Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically

Description

Multilingual Scraper of Privacy Policies and Terms of Service: Scraped Documents of 2024

This dataset supplements publication "Multilingual Scraper of Privacy Policies and Terms of Service" at ACM CSLAW’25, March 25–27, 2025, München, Germany. It includes the first 12 months of scraped policies and terms from about 800k websites, see concrete numbers below.

The following table lists the amount of websites visited per month:

Month	Number of websites
2024-01	551'148
2024-02	792'921
2024-03	844'537
2024-04	802'169
2024-05	805'878
2024-06	809'518
2024-07	811'418
2024-08	813'534
2024-09	814'321
2024-10	817'586
2024-11	828'662
2024-12	827'101

The amount of websites visited should always be higher than the number of jobs (Table 1 of the paper) as a website may redirect, resulting in two websites scraped or it has to be retried.

To simplify the access, we release the data in large CSVs. Namely, there is one file for policies and another for terms per month. All of these files contain all metadata that are usable for the analysis. If your favourite CSV parser reports the same numbers as above then our dataset is correctly parsed. We use ‘,’ as a separator, the first row is the heading and strings are in quotes.

Since our scraper sometimes collects other documents than policies and terms (for how often this happens, see the evaluation in Sec. 4 of the publication) that might contain personal data such as addresses of authors of websites that they maintain only for a selected audience. We therefore decided to reduce the risks for websites by anonymizing the data using Presidio. Presidio substitutes personal data with tokens. If your personal data has not been effectively anonymized from the database and you wish for it to be deleted, please contact us.

Preliminaries

The uncompressed dataset is about 125 GB in size, so you will need sufficient storage. This also means that you likely cannot process all the data at once in your memory, so we split the data in months and in files for policies and terms.

Files and structure

The files have the following names:

2024_policy.csv for policies
2024_terms.csv for terms

Shared metadata

Both files contain the following metadata columns:

website_month_id - identification of crawled website
job_id - one website can have multiple jobs in case of redirects (but most commonly has only one)
website_index_status - network state of loading the index page. This is resolved by the Chromed DevTools Protocol.
- DNS_ERROR - domain cannot be resolved
- OK - all fine
- REDIRECT - domain redirect to somewhere else
- TIMEOUT - the request timed out
- BAD_CONTENT_TYPE - 415 Unsupported Media Type
- HTTP_ERROR - 404 error
- TCP_ERROR - error in the network connection
- UNKNOWN_ERROR - unknown error
website_lang - language of index page detected based on langdetect library
website_url - the URL of the website sampled from the CrUX list (may contain subdomains, etc). Use this as a unique identifier for connecting data between months.
job_domain_status - indicates the status of loading the index page. Can be:
- OK - all works well (at the moment, should be all entries)
- BLACKLISTED - URL is on our list of blocked URLs
- UNSAFE - website is not safe according to save browsing API by Google
- LOCATION_BLOCKED - country is in the list of blocked countries
job_started_at - when the visit of the website was started
job_ended_at - when the visit of the website was ended
job_crux_popularity - JSON with all popularity ranks of the website this month
job_index_redirect - when we detect that the domain redirects us, we stop the crawl and create a new job with the target URL. This saves time if many websites redirect to one target, as it will be crawled only once. The index_redirect is then the job.id corresponding to the redirect target.
job_num_starts - amount of crawlers that started this job (counts restarts in case of unsuccessful crawl, max is 3)
job_from_static - whether this job was included in the static selection (see Sec. 3.3 of the paper)
job_from_dynamic - whether this job was included in the dynamic selection (see Sec. 3.3 of the paper) - this is not exclusive with from_static - both can be true when the lists overlap.
job_crawl_name - our name of the crawl, contains year and month (e.g., 'regular-2024-12' for regular crawls, in Dec 2024)

Policy data

policy_url_id - ID of the URL this policy has
policy_keyword_score - score (higher is better) according to the crawler's keywords list that given document is a policy
policy_ml_probability - probability assigned by the BERT model that given document is a policy
policy_consideration_basis - on which basis we decided that this url is policy. The following three options are executed by the crawler in this order:
1. 'keyword matching' - this policy was found using the crawler navigation (which is based on keywords)
2. 'search' - this policy was found using search engine
3. 'path guessing' - this policy was found by using well-known URLs like example.com/policy
policy_url - full URL to the policy
policy_content_hash - used as identifier - if the document remained the same between crawls, it won't create a new entry
policy_content - contains the text of policies and terms extracted to Markdown using Mozilla's readability library
policy_lang - Language detected by fasttext of the content

Terms data

Analogous to policy data, just substitute policy to terms.

Updates

Check this Google Docs for an updated version of this README.md.

Clear search

Close search

Google apps

Main menu

Multilingual Scraper of Privacy Policies and Terms of Service

Multilingual Scraper of Privacy Policies and Terms of Service: Scraped Documents of 2024

Preliminaries

Files and structure

Shared metadata

Policy data

Terms data

Updates

Distribution of websites regularly visited in Sweden 2017-2018

A web tracking data set of online browsing behavior of 2,148 users

Distribution of websites regularly visited in Norway 2017-2018

Swash Web Browsing Clickstream Data - 1.5M Worldwide Users - GDPR Compliant

Web Traffic Data | 500M+ US Web Traffic Data Resolution | B2B and B2C...

Peru: change in visits to websites due to COVID-19 2020, by category

How Citizens Prefer to Access Data on Government Websites (Detail)

Alexa, International Top 100 Websites, Global, 10.12.2007

Data from: Web Experience in Mobile Networks: Lessons from Two Million Page...

Google SERP Data, Web Search Data, Google Images Data | Real-Time API

Distribution of websites regularly visited in Poland 2016-2018

Replication Data for "Prevalence of Third-Party Tracking on Abortion Clinic...

Data from: Geothermal Websites

Local Directgov web service

Most popular websites in the Netherlands 2015

Best Practices for Passphrases and Passwords (ITSAP.30.032) - Catalogue -...

Data from: Analysis of the Quantitative Impact of Social Networks General...

Local Directgov web service - Dataset - data.gov.uk

DATAANT | Travel Data | Dataset, API | Booking and Pricing Data: Hotel...

Multilingual Scraper of Privacy Policies and Terms of Service

Multilingual Scraper of Privacy Policies and Terms of Service: Scraped Documents of 2024

Preliminaries

Files and structure

Shared metadata

Policy data

Terms data

Updates