100+ datasets found
  1. Multilingual Scraper of Privacy Policies and Terms of Service

    • zenodo.org
    bin, zip
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Bernhard; David Bernhard; Luka Nenadic; Luka Nenadic; Stefan Bechtold; Karel Kubicek; Karel Kubicek; Stefan Bechtold (2025). Multilingual Scraper of Privacy Policies and Terms of Service [Dataset]. http://doi.org/10.5281/zenodo.14562039
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    David Bernhard; David Bernhard; Luka Nenadic; Luka Nenadic; Stefan Bechtold; Karel Kubicek; Karel Kubicek; Stefan Bechtold
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Multilingual Scraper of Privacy Policies and Terms of Service: Scraped Documents of 2024

    This dataset supplements publication "Multilingual Scraper of Privacy Policies and Terms of Service" at ACM CSLAW’25, March 25–27, 2025, München, Germany. It includes the first 12 months of scraped policies and terms from about 800k websites, see concrete numbers below.

    The following table lists the amount of websites visited per month:

    MonthNumber of websites
    2024-01551'148
    2024-02792'921
    2024-03844'537
    2024-04802'169
    2024-05805'878
    2024-06809'518
    2024-07811'418
    2024-08813'534
    2024-09814'321
    2024-10817'586
    2024-11828'662
    2024-12827'101

    The amount of websites visited should always be higher than the number of jobs (Table 1 of the paper) as a website may redirect, resulting in two websites scraped or it has to be retried.

    To simplify the access, we release the data in large CSVs. Namely, there is one file for policies and another for terms per month. All of these files contain all metadata that are usable for the analysis. If your favourite CSV parser reports the same numbers as above then our dataset is correctly parsed. We use ‘,’ as a separator, the first row is the heading and strings are in quotes.

    Since our scraper sometimes collects other documents than policies and terms (for how often this happens, see the evaluation in Sec. 4 of the publication) that might contain personal data such as addresses of authors of websites that they maintain only for a selected audience. We therefore decided to reduce the risks for websites by anonymizing the data using Presidio. Presidio substitutes personal data with tokens. If your personal data has not been effectively anonymized from the database and you wish for it to be deleted, please contact us.

    Preliminaries

    The uncompressed dataset is about 125 GB in size, so you will need sufficient storage. This also means that you likely cannot process all the data at once in your memory, so we split the data in months and in files for policies and terms.

    Files and structure

    The files have the following names:

    • 2024_policy.csv for policies
    • 2024_terms.csv for terms

    Shared metadata

    Both files contain the following metadata columns:

    • website_month_id - identification of crawled website
    • job_id - one website can have multiple jobs in case of redirects (but most commonly has only one)
    • website_index_status - network state of loading the index page. This is resolved by the Chromed DevTools Protocol.
      • DNS_ERROR - domain cannot be resolved
      • OK - all fine
      • REDIRECT - domain redirect to somewhere else
      • TIMEOUT - the request timed out
      • BAD_CONTENT_TYPE - 415 Unsupported Media Type
      • HTTP_ERROR - 404 error
      • TCP_ERROR - error in the network connection
      • UNKNOWN_ERROR - unknown error
    • website_lang - language of index page detected based on langdetect library
    • website_url - the URL of the website sampled from the CrUX list (may contain subdomains, etc). Use this as a unique identifier for connecting data between months.
    • job_domain_status - indicates the status of loading the index page. Can be:
      • OK - all works well (at the moment, should be all entries)
      • BLACKLISTED - URL is on our list of blocked URLs
      • UNSAFE - website is not safe according to save browsing API by Google
      • LOCATION_BLOCKED - country is in the list of blocked countries
    • job_started_at - when the visit of the website was started
    • job_ended_at - when the visit of the website was ended
    • job_crux_popularity - JSON with all popularity ranks of the website this month
    • job_index_redirect - when we detect that the domain redirects us, we stop the crawl and create a new job with the target URL. This saves time if many websites redirect to one target, as it will be crawled only once. The index_redirect is then the job.id corresponding to the redirect target.
    • job_num_starts - amount of crawlers that started this job (counts restarts in case of unsuccessful crawl, max is 3)
    • job_from_static - whether this job was included in the static selection (see Sec. 3.3 of the paper)
    • job_from_dynamic - whether this job was included in the dynamic selection (see Sec. 3.3 of the paper) - this is not exclusive with from_static - both can be true when the lists overlap.
    • job_crawl_name - our name of the crawl, contains year and month (e.g., 'regular-2024-12' for regular crawls, in Dec 2024)

    Policy data

    • policy_url_id - ID of the URL this policy has
    • policy_keyword_score - score (higher is better) according to the crawler's keywords list that given document is a policy
    • policy_ml_probability - probability assigned by the BERT model that given document is a policy
    • policy_consideration_basis - on which basis we decided that this url is policy. The following three options are executed by the crawler in this order:
      1. 'keyword matching' - this policy was found using the crawler navigation (which is based on keywords)
      2. 'search' - this policy was found using search engine
      3. 'path guessing' - this policy was found by using well-known URLs like example.com/policy
    • policy_url - full URL to the policy
    • policy_content_hash - used as identifier - if the document remained the same between crawls, it won't create a new entry
    • policy_content - contains the text of policies and terms extracted to Markdown using Mozilla's readability library
    • policy_lang - Language detected by fasttext of the content

    Terms data

    Analogous to policy data, just substitute policy to terms.

    Updates

    Check this Google Docs for an updated version of this README.md.

  2. Distribution of websites regularly visited in Sweden 2017-2018

    • statista.com
    Updated Jul 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Distribution of websites regularly visited in Sweden 2017-2018 [Dataset]. https://www.statista.com/statistics/570094/distribution-of-websites-visited-regularly-in-sweden/
    Explore at:
    Dataset updated
    Jul 7, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Sweden
    Description

    This statistic shows the results of a survey conducted by Cint on the distribution of websites regularly visited in Sweden in 2017 and 2018. In 2018, ***** percent of respondents stated that they visit Google regularly.

  3. A web tracking data set of online browsing behavior of 2,148 users

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, txt +1
    Updated Oct 9, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juhi Kulshrestha; Juhi Kulshrestha; Marcos Oliveira; Marcos Oliveira; Orkut Karacalik; Denis Bonnay; Claudia Wagner; Orkut Karacalik; Denis Bonnay; Claudia Wagner (2025). A web tracking data set of online browsing behavior of 2,148 users [Dataset]. http://doi.org/10.5281/zenodo.4757574
    Explore at:
    zip, txt, application/gzipAvailable download formats
    Dataset updated
    Oct 9, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Juhi Kulshrestha; Juhi Kulshrestha; Marcos Oliveira; Marcos Oliveira; Orkut Karacalik; Denis Bonnay; Claudia Wagner; Orkut Karacalik; Denis Bonnay; Claudia Wagner
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This anonymized data set consists of one month's (October 2018) web tracking data of 2,148 German users. For each user, the data contains the anonymized URL of the webpage the user visited, the domain of the webpage, category of the domain, which provides 41 distinct categories. In total, these 2,148 users made 9,151,243 URL visits, spanning 49,918 unique domains. For each user in our data set, we have self-reported information (collected via a survey) about their gender and age.

    We acknowledge the support of Respondi AG, which provided the web tracking and survey data free of charge for research purposes, with special thanks to François Erner and Luc Kalaora at Respondi for their insights and help with data extraction.

    The data set is analyzed in the following paper:

    • Kulshrestha, J., Oliveira, M., Karacalik, O., Bonnay, D., Wagner, C. "Web Routineness and Limits of Predictability: Investigating Demographic and Behavioral Differences Using Web Tracking Data." Proceedings of the International AAAI Conference on Web and Social Media. 2021. https://arxiv.org/abs/2012.15112.

    The code used to analyze the data is also available at https://github.com/gesiscss/web_tracking.

    If you use data or code from this repository, please cite the paper above and the Zenodo link.

    Users are advised that some domains in this data set may link to potentially questionable or inappropriate content. The domains have not been individually reviewed, as content verification was not the primary objective of this data set. Therefore, user discretion is strongly recommended when accessing or scraping any content from these domains.

  4. Distribution of websites regularly visited in Norway 2017-2018

    • statista.com
    Updated Jul 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Distribution of websites regularly visited in Norway 2017-2018 [Dataset]. https://www.statista.com/statistics/570058/distribution-of-websites-visited-regularly-in-norway/
    Explore at:
    Dataset updated
    Jul 11, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Norway
    Description

    This statistic shows the results of a survey conducted by Cint on the distribution of websites regularly visited in Norway in 2017 and 2018. In 2018, **** percent of respondents stated that they visit YouTube regularly.

  5. d

    Swash Web Browsing Clickstream Data - 1.5M Worldwide Users - GDPR Compliant

    • datarade.ai
    .csv, .xls
    Updated Jun 27, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Swash (2023). Swash Web Browsing Clickstream Data - 1.5M Worldwide Users - GDPR Compliant [Dataset]. https://datarade.ai/data-products/swash-blockchain-bitcoin-and-web3-enthusiasts-swash
    Explore at:
    .csv, .xlsAvailable download formats
    Dataset updated
    Jun 27, 2023
    Dataset authored and provided by
    Swash
    Area covered
    Saint Vincent and the Grenadines, Latvia, Jordan, Belarus, Jamaica, Uzbekistan, Monaco, Liechtenstein, Russian Federation, India
    Description

    Unlock the Power of Behavioural Data with GDPR-Compliant Clickstream Insights.

    Swash clickstream data offers a comprehensive and GDPR-compliant dataset sourced from users worldwide, encompassing both desktop and mobile browsing behaviour. Here's an in-depth look at what sets us apart and how our data can benefit your organisation.

    User-Centric Approach: Unlike traditional data collection methods, we take a user-centric approach by rewarding users for the data they willingly provide. This unique methodology ensures transparent data collection practices, encourages user participation, and establishes trust between data providers and consumers.

    Wide Coverage and Varied Categories: Our clickstream data covers diverse categories, including search, shopping, and URL visits. Whether you are interested in understanding user preferences in e-commerce, analysing search behaviour across different industries, or tracking website visits, our data provides a rich and multi-dimensional view of user activities.

    GDPR Compliance and Privacy: We prioritise data privacy and strictly adhere to GDPR guidelines. Our data collection methods are fully compliant, ensuring the protection of user identities and personal information. You can confidently leverage our clickstream data without compromising privacy or facing regulatory challenges.

    Market Intelligence and Consumer Behaviuor: Gain deep insights into market intelligence and consumer behaviour using our clickstream data. Understand trends, preferences, and user behaviour patterns by analysing the comprehensive user-level, time-stamped raw or processed data feed. Uncover valuable information about user journeys, search funnels, and paths to purchase to enhance your marketing strategies and drive business growth.

    High-Frequency Updates and Consistency: We provide high-frequency updates and consistent user participation, offering both historical data and ongoing daily delivery. This ensures you have access to up-to-date insights and a continuous data feed for comprehensive analysis. Our reliable and consistent data empowers you to make accurate and timely decisions.

    Custom Reporting and Analysis: We understand that every organisation has unique requirements. That's why we offer customisable reporting options, allowing you to tailor the analysis and reporting of clickstream data to your specific needs. Whether you need detailed metrics, visualisations, or in-depth analytics, we provide the flexibility to meet your reporting requirements.

    Data Quality and Credibility: We take data quality seriously. Our data sourcing practices are designed to ensure responsible and reliable data collection. We implement rigorous data cleaning, validation, and verification processes, guaranteeing the accuracy and reliability of our clickstream data. You can confidently rely on our data to drive your decision-making processes.

  6. d

    Web Traffic Data | 500M+ US Web Traffic Data Resolution | B2B and B2C...

    • datarade.ai
    .csv, .xls
    Updated Feb 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Allforce (2025). Web Traffic Data | 500M+ US Web Traffic Data Resolution | B2B and B2C Website Visitor Identity Resolution [Dataset]. https://datarade.ai/data-products/traffic-continuum-from-solution-publishing-500m-us-web-traf-solution-publishing
    Explore at:
    .csv, .xlsAvailable download formats
    Dataset updated
    Feb 24, 2025
    Dataset authored and provided by
    Allforce
    Area covered
    United States of America
    Description

    Unlock the Potential of Your Web Traffic with Advanced Data Resolution

    In the digital age, understanding and leveraging web traffic data is crucial for businesses aiming to thrive online. Our pioneering solution transforms anonymous website visits into valuable B2B and B2C contact data, offering unprecedented insights into your digital audience. By integrating our unique tag into your website, you unlock the capability to convert 25-50% of your anonymous traffic into actionable contact rows, directly deposited into an S3 bucket for your convenience. This process, known as "Web Traffic Data Resolution," is at the forefront of digital marketing and sales strategies, providing a competitive edge in understanding and engaging with your online visitors.

    Comprehensive Web Traffic Data Resolution Our product stands out by offering a robust solution for "Web Traffic Data Resolution," a process that demystifies the identities behind your website traffic. By deploying a simple tag on your site, our technology goes to work, analyzing visitor behavior and leveraging proprietary data matching techniques to reveal the individuals and businesses behind the clicks. This innovative approach not only enhances your data collection but does so with respect for privacy and compliance standards, ensuring that your business gains insights ethically and responsibly.

    Deep Dive into Web Traffic Data At the core of our solution is the sophisticated analysis of "Web Traffic Data." Our system meticulously collects and processes every interaction on your site, from page views to time spent on each section. This data, once anonymous and perhaps seen as abstract numbers, is transformed into a detailed ledger of potential leads and customer insights. By understanding who visits your site, their interests, and their contact information, your business is equipped to tailor marketing efforts, personalize customer experiences, and streamline sales processes like never before.

    Benefits of Our Web Traffic Data Resolution Service Enhanced Lead Generation: By converting anonymous visitors into identifiable contact data, our service significantly expands your pool of potential leads. This direct enhancement of your lead generation efforts can dramatically increase conversion rates and ROI on marketing campaigns.

    Targeted Marketing Campaigns: Armed with detailed B2B and B2C contact data, your marketing team can create highly targeted and personalized campaigns. This precision in marketing not only improves engagement rates but also ensures that your messaging resonates with the intended audience.

    Improved Customer Insights: Gaining a deeper understanding of your web traffic enables your business to refine customer personas and tailor offerings to meet market demands. These insights are invaluable for product development, customer service improvement, and strategic planning.

    Competitive Advantage: In a digital landscape where understanding your audience can make or break your business, our Web Traffic Data Resolution service provides a significant competitive edge. By accessing detailed contact data that others in your industry may overlook, you position your business as a leader in customer engagement and data-driven strategies.

    Seamless Integration and Accessibility: Our solution is designed for ease of use, requiring only the placement of a tag on your website to start gathering data. The contact rows generated are easily accessible in an S3 bucket, ensuring that you can integrate this data with your existing CRM systems and marketing tools without hassle.

    How It Works: A Closer Look at the Process Our Web Traffic Data Resolution process is streamlined and user-friendly, designed to integrate seamlessly with your existing website infrastructure:

    Tag Deployment: Implement our unique tag on your website with simple instructions. This tag is lightweight and does not impact your site's loading speed or user experience.

    Data Collection and Analysis: As visitors navigate your site, our system collects web traffic data in real-time, analyzing behavior patterns, engagement metrics, and more.

    Resolution and Transformation: Using advanced data matching algorithms, we resolve the collected web traffic data into identifiable B2B and B2C contact information.

    Data Delivery: The resolved contact data is then securely transferred to an S3 bucket, where it is organized and ready for your access. This process occurs daily, ensuring you have the most up-to-date information at your fingertips.

    Integration and Action: With the resolved data now in your possession, your business can take immediate action. From refining marketing strategies to enhancing customer experiences, the possibilities are endless.

    Security and Privacy: Our Commitment Understanding the sensitivity of web traffic data and contact information, our solution is built with security and privacy at its core. We adhere to strict data protection regulat...

  7. Peru: change in visits to websites due to COVID-19 2020, by category

    • statista.com
    Updated Jul 8, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Peru: change in visits to websites due to COVID-19 2020, by category [Dataset]. https://www.statista.com/statistics/1108733/peru-website-visitors-change/
    Explore at:
    Dataset updated
    Jul 8, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Feb 20, 2020 - Mar 21, 2020
    Area covered
    Peru
    Description

    Approximately **** percent of internet users surveyed in Peru said that they had accessed news websites and apps from February 20, 2020 to March 5, 2020. Once the first case of COVID-19 in the country was reported, on March 6, 2020, until March 21, 2020, more than ** percent of the respondents stated that they had visited online news platforms. Meanwhile, the share of interviewees who said they had visited shopping websites and apps decreased *** percentage points, from **** percent before the coronavirus outbreak in Peru to **** percent afterwards.

  8. How Citizens Prefer to Access Data on Government Websites (Detail)

    • benchmarkstudy.socrata.com
    csv, xlsx, xml
    Updated Aug 21, 2011
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Socrata Open Government Data Benchmark Study (2011). How Citizens Prefer to Access Data on Government Websites (Detail) [Dataset]. https://benchmarkstudy.socrata.com/Public-Survey/How-Citizens-Prefer-to-Access-Data-on-Government-W/xkgk-r22k
    Explore at:
    xlsx, csv, xmlAvailable download formats
    Dataset updated
    Aug 21, 2011
    Dataset provided by
    Socratahttp://www.blist.com/
    data.gov.inhttp://data.gov.in/
    Authors
    Socrata Open Government Data Benchmark Study
    License

    Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
    License information was derived automatically

    Description

    Citizen respondents rank how they want to interact with and consume government data. Survey responses are broken down along several dimensions including, Region, Education Level, Gender and Household (HH) Income.

  9. g

    Alexa, International Top 100 Websites, Global, 10.12.2007

    • geocommons.com
    Updated Apr 29, 2008
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexa (2008). Alexa, International Top 100 Websites, Global, 10.12.2007 [Dataset]. http://geocommons.com/search.html
    Explore at:
    Dataset updated
    Apr 29, 2008
    Dataset provided by
    data
    Alexa
    Description

    This Dataset shows the Alexa Top 100 International Websites, and provides metrics on the volume of traffic that these sites were able to handle. The Alexa top 100 lists the 100 most visited websites in the world and measures various statistical information. I have looked up the Headquarters, either through alexa, or a Whois Lookup to get street address with i was then able to geocode. I was only able to successfully geocode 85 of the top 100 sites throughout the world. Source of Data was Alexa.com, Source URL: http://www.alexa.com/site/ds/top_sites?ts_mode=global&lang=none Data was from October 12, 2007. Alexa is updated daily so to get more up to date information visit their site directly. they don't have maps though.

  10. Data from: Web Experience in Mobile Networks: Lessons from Two Million Page...

    • data.europa.eu
    unknown
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2025). Web Experience in Mobile Networks: Lessons from Two Million Page Visits [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-2574157?locale=nl
    Explore at:
    unknown(896)Available download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Measuring and characterizing web page performance is a challenging task. When it comes to the mobile world, the highly varying technology characteristics coupled with the opaque network configuration make it even more difficult. Aiming at reproducibility, we present a large scale measurements study of web page performance collected in eleven commercial mobile networks spanning four countries. We build a dataset of nearly two million web browsing sessions to we shed light on the impact of different web protocols, browsers, and mobile technologies on the web performance. We find that the impact of mobile broadband access is sizeable. For example, the median page load time using mobile broadband increases by a third compared to wired access. Mobility clearly stresses the system, with handover causing the most evident performance penalties. Contrariwise, our measurements show that the adoption of HTTP/2 and QUIC has practically negligible impact. Our work highlights the importance of large-scale measurements. Even with our controlled setup, the complexity of the mobile web ecosystem is challenging to untangle. For this, we are releasing the dataset as open data for validation and further research. We also release together with the datasets we collected the scripts we use to produce the analysis we present in the paper. Please use plot_all.sh script to generate the plots in the paper, using the separate scripts from the "scripts" archive. Should you use any of these resources, please also make an attribution using the following reference (provided here in bibtex format): @inproceedings{rajiullah2019web, title={{Web Experience in Mobile Networks: Lessons from Two Million Page Visits}}, author={Rajiullah, Mohammad and Lutu, Andra and Khatouni, Ali Safari and Fida, Mah-Rukh and Mellia, Marco and Brunstrom, Anna and Alay, Ozgu and Alfredsson, Stefan and Mancuso, Vincenzo}, booktitle={The World Wide Web Conference}, pages={1532--1543}, year={2019}, organization={ACM}, address = {San Francisco, CA, USA}, keywords = {Web Experience, HTTP2, QUIC, TCP, Mobile Broadband, Measurements} }

  11. d

    Google SERP Data, Web Search Data, Google Images Data | Real-Time API

    • datarade.ai
    .json, .csv
    Updated May 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenWeb Ninja (2024). Google SERP Data, Web Search Data, Google Images Data | Real-Time API [Dataset]. https://datarade.ai/data-products/openweb-ninja-google-data-google-image-data-google-serp-d-openweb-ninja
    Explore at:
    .json, .csvAvailable download formats
    Dataset updated
    May 17, 2024
    Dataset authored and provided by
    OpenWeb Ninja
    Area covered
    Uganda, Burundi, Panama, South Georgia and the South Sandwich Islands, Ireland, Barbados, Tokelau, Grenada, Virgin Islands (U.S.), Uruguay
    Description

    OpenWeb Ninja's Google Images Data (Google SERP Data) API provides real-time image search capabilities for images sourced from all public sources on the web.

    The API enables you to search and access more than 100 billion images from across the web including advanced filtering capabilities as supported by Google Advanced Image Search. The API provides Google Images Data (Google SERP Data) including details such as image URL, title, size information, thumbnail, source information, and more data points. The API supports advanced filtering and options such as file type, image color, usage rights, creation time, and more. In addition, any Advanced Google Search operators can be used with the API.

    OpenWeb Ninja's Google Images Data & Google SERP Data API common use cases:

    • Creative Media Production: Enhance digital content with a vast array of real-time images, ensuring engaging and brand-aligned visuals for blogs, social media, and advertising.

    • AI Model Enhancement: Train and refine AI models with diverse, annotated images, improving object recognition and image classification accuracy.

    • Trend Analysis: Identify emerging market trends and consumer preferences through real-time visual data, enabling proactive business decisions.

    • Innovative Product Design: Inspire product innovation by exploring current design trends and competitor products, ensuring market-relevant offerings.

    • Advanced Search Optimization: Improve search engines and applications with enriched image datasets, providing users with accurate, relevant, and visually appealing search results.

    OpenWeb Ninja's Annotated Imagery Data & Google SERP Data Stats & Capabilities:

    • 100B+ Images: Access an extensive database of over 100 billion images.

    • Images Data from all Public Sources (Google SERP Data): Benefit from a comprehensive aggregation of image data from various public websites, ensuring a wide range of sources and perspectives.

    • Extensive Search and Filtering Capabilities: Utilize advanced search operators and filters to refine image searches by file type, color, usage rights, creation time, and more, making it easy to find exactly what you need.

    • Rich Data Points: Each image comes with more than 10 data points, including URL, title (annotation), size information, thumbnail, and source information, providing a detailed context for each image.

  12. Distribution of websites regularly visited in Poland 2016-2018

    • statista.com
    Updated Jul 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Distribution of websites regularly visited in Poland 2016-2018 [Dataset]. https://www.statista.com/statistics/570073/distribution-of-websites-visited-regularly-in-poland/
    Explore at:
    Dataset updated
    Jul 7, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Poland
    Description

    This statistic shows the results of a survey conducted by Cint on the distribution of websites regularly visited in Poland from 2016 to 2018. In 2017, ***** percent of respondents stated that they visit Facebook regularly.

  13. Replication Data for "Prevalence of Third-Party Tracking on Abortion Clinic...

    • figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ari B Friedman (2023). Replication Data for "Prevalence of Third-Party Tracking on Abortion Clinic Web Pages" [Dataset]. http://doi.org/10.6084/m9.figshare.21437970.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Ari B Friedman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this cross-sectional study, we extracted the uniform resource locator (URL) of each National Abortion Federation member facility on May 6, 2022. We visited each unique URL using webXray (Timothy Libert), which detects third-party tracking. For each web page, we recorded data transfers to third-party domains. Transfers typically include a user’s IP (internet protocol) address and the web page being visited. We also recorded the presence of third-party cookies, data stored on a user’s computer that can facilitate tracking across multiple websites.

  14. w

    Data from: Geothermal Websites

    • data.wu.ac.at
    pdf
    Updated Dec 4, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). Geothermal Websites [Dataset]. https://data.wu.ac.at/schema/geothermaldata_org/MDZmNDdkZjUtNGUxNC00MzY3LWIwOWUtOWQ3YTNiMjIxNjBh
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Dec 4, 2017
    Area covered
    e34793041040ebeecf598817bd0fc648c3de349c
    Description

    The Internet has become such an important part of our every day life. It can be used to correspond with people across the world, a lot faster than send a letter in the mail. The Internet has a wealth of information that is available to anybody just by searching for it. Sometimes you get more information than you ever wanted to know and sometimes you just canit find the information.This paper only covers a small portion of the websites and their links that have geothermal information concerning reservoir engineering, enhanced geothermal systems and other aspects of geothermal. Some of the websites below are located in the US, international websites, geothermal associations, and websites where you can access publications. Most of the websites listed below also have links to other websites for even more information.

  15. e

    Local Directgov web service

    • data.europa.eu
    • data.wu.ac.at
    xml
    Updated Aug 7, 2011
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ministry of Housing, Communities and Local Government (2011). Local Directgov web service [Dataset]. https://data.europa.eu/data/datasets/local_directgov_web_service?locale=da
    Explore at:
    xmlAvailable download formats
    Dataset updated
    Aug 7, 2011
    Dataset authored and provided by
    Ministry of Housing, Communities and Local Government
    License

    http://reference.data.gov.uk/id/open-government-licencehttp://reference.data.gov.uk/id/open-government-licence

    Description

    The Local Directgov web service gives you direct access to the functions that drive the local government services on the Directgov website, so that you can use them in your own websites and other computer applications. They allow you to obtain service data directly from Local Directgov's database, looking up a specific service URL for a local authority, or general contact details if one cannot be found. Alternatively you can use different web service methods to request specific information.

  16. D

    Most popular websites in the Netherlands 2015

    • ssh.datastations.nl
    csv, tsv, zip
    Updated May 9, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M. Kleppe; H. Bijleveld; M. Kleppe; H. Bijleveld (2017). Most popular websites in the Netherlands 2015 [Dataset]. http://doi.org/10.17026/DANS-X6H-6QQT
    Explore at:
    zip(15855), csv(138294), tsv(176359)Available download formats
    Dataset updated
    May 9, 2017
    Dataset provided by
    DANS Data Station Social Sciences and Humanities
    Authors
    M. Kleppe; H. Bijleveld; M. Kleppe; H. Bijleveld
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Netherlands
    Dataset funded by
    NWO
    Description

    This dataset contains a list of 3654 Dutch websites that we considered the most popular websites in 2015. This list served as whitelist for the Newstracker Research project in which we monitored the online web behaviour of a group of respondents.The research project 'The Newstracker' was a subproject of the NWO-funded project 'The New News Consumer: A User-Based Innovation Project to Meet Paradigmatic Change in News Use and Media Habits'.For the Newstracker project we aimed to understand the web behaviour of a group of respondents. We created custom-built software to monitor their web browsing behaviour on their laptops and desktops (please find the code in open access at https://github.com/NITechLabs/NewsTracker). For reasons of scale and privacy we created a whitelist with websites that were the most popular websites in 2015. We manually compiled this list by using data of DDMM, Alexa and own research. The dataset consists of 5 columns:- the URL- the type of website: We created a list of types of websites and each website has been manually labeled with 1 category- Nieuws-regio: When the category was 'News', we subdivided these websites in the regional focus: International, National or Local- Nieuws-onderwerp: Furthermore, each website under the category News was further subdivided in type of news website. For this we created an own list of news categories and manually coded each website- Bron: For each website we noted which source we used to find this website.The full description of the research design of the Newstracker including the set-up of this whitelist is included in the following article: Kleppe, M., Otte, M. (in print), 'Analysing & understanding news consumption patterns by tracking online user behaviour with a multimodal research design', Digital Scholarship in the Humanities, doi 10.1093/llc/fqx030.

  17. u

    Best Practices for Passphrases and Passwords (ITSAP.30.032) - Catalogue -...

    • data.urbandatacentre.ca
    Updated Sep 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Best Practices for Passphrases and Passwords (ITSAP.30.032) - Catalogue - Canadian Urban Data Catalogue (CUDC) [Dataset]. https://data.urbandatacentre.ca/dataset/gov-canada-f57e1d6d-6a60-456c-ba5a-a535d1252798
    Explore at:
    Dataset updated
    Sep 30, 2024
    License

    Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
    License information was derived automatically

    Area covered
    Canada
    Description

    You have passwords for everything: your devices, your accounts (e.g. banking, social media, and email), and the websites you visit. By using passphrases or strong passwords you can protect your devices and information. Review the tips below to learn how you can create passphrases, strengthen your passwords, and avoid common mistakes that could put your information at risk.

  18. Data from: Analysis of the Quantitative Impact of Social Networks General...

    • figshare.com
    • produccioncientifica.ucm.es
    doc
    Updated Oct 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Parra; Santiago Martínez Arias; Sergio Mena Muñoz (2022). Analysis of the Quantitative Impact of Social Networks General Data.doc [Dataset]. http://doi.org/10.6084/m9.figshare.21329421.v1
    Explore at:
    docAvailable download formats
    Dataset updated
    Oct 14, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    David Parra; Santiago Martínez Arias; Sergio Mena Muñoz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    General data recollected for the studio " Analysis of the Quantitative Impact of Social Networks on Web Traffic of Cybermedia in the 27 Countries of the European Union". Four research questions are posed: what percentage of the total web traffic generated by cybermedia in the European Union comes from social networks? Is said percentage higher or lower than that provided through direct traffic and through the use of search engines via SEO positioning? Which social networks have a greater impact? And is there any degree of relationship between the specific weight of social networks in the web traffic of a cybermedia and circumstances such as the average duration of the user's visit, the number of page views or the bounce rate understood in its formal aspect of not performing any kind of interaction on the visited page beyond reading its content? To answer these questions, we have first proceeded to a selection of the cybermedia with the highest web traffic of the 27 countries that are currently part of the European Union after the United Kingdom left on December 31, 2020. In each nation we have selected five media using a combination of the global web traffic metrics provided by the tools Alexa (https://www.alexa.com/), which ceased to be operational on May 1, 2022, and SimilarWeb (https:// www.similarweb.com/). We have not used local metrics by country since the results obtained with these first two tools were sufficiently significant and our objective is not to establish a ranking of cybermedia by nation but to examine the relevance of social networks in their web traffic. In all cases, cybermedia whose property corresponds to a journalistic company have been selected, ruling out those belonging to telecommunications portals or service providers; in some cases they correspond to classic information companies (both newspapers and televisions) while in others they refer to digital natives, without this circumstance affecting the nature of the research proposed.
    Below we have proceeded to examine the web traffic data of said cybermedia. The period corresponding to the months of October, November and December 2021 and January, February and March 2022 has been selected. We believe that this six-month stretch allows possible one-time variations to be overcome for a month, reinforcing the precision of the data obtained. To secure this data, we have used the SimilarWeb tool, currently the most precise tool that exists when examining the web traffic of a portal, although it is limited to that coming from desktops and laptops, without taking into account those that come from mobile devices, currently impossible to determine with existing measurement tools on the market. It includes:

    Web traffic general data: average visit duration, pages per visit and bounce rate Web traffic origin by country Percentage of traffic generated from social media over total web traffic Distribution of web traffic generated from social networks Comparison of web traffic generated from social netwoks with direct and search procedures

  19. Local Directgov web service - Dataset - data.gov.uk

    • ckan.publishing.service.gov.uk
    Updated Sep 27, 2011
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ckan.publishing.service.gov.uk (2011). Local Directgov web service - Dataset - data.gov.uk [Dataset]. https://ckan.publishing.service.gov.uk/dataset/local_directgov_web_service
    Explore at:
    Dataset updated
    Sep 27, 2011
    Dataset provided by
    CKANhttps://ckan.org/
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Description

    The Local Directgov web service gives you direct access to the functions that drive the local government services on the Directgov website, so that you can use them in your own websites and other computer applications. They allow you to obtain service data directly from Local Directgov's database, looking up a specific service URL for a local authority, or general contact details if one cannot be found. Alternatively you can use different web service methods to request specific information.

  20. d

    DATAANT | Travel Data | Dataset, API | Booking and Pricing Data: Hotel...

    • datarade.ai
    Updated Mar 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataant (2023). DATAANT | Travel Data | Dataset, API | Booking and Pricing Data: Hotel Websites, Flight Aggregators and Rental Aggregators | Global Coverage [Dataset]. https://datarade.ai/data-products/dataant-travel-data-dataset-api-booking-and-pricing-da-dataant
    Explore at:
    .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Mar 1, 2023
    Dataset authored and provided by
    Dataant
    Area covered
    Svalbard and Jan Mayen, Bulgaria, Honduras, Kyrgyzstan, Greece, Norfolk Island, Dominican Republic, Vietnam, Luxembourg, Saint Barthélemy
    Description

    DATAANT provides the ability to extract travel data from public sources like: - Hotel websites - Flight aggregators - Homestay marketplaces - Experience marketplaces - Online Travel Agencies (OTA) and any open travel industry website you need.

    Forecast travel trends with Booking.com, Airbnb, and travel aggregators data.

    We support providing both raw and structured data with various delivery methods.

    Get the competitive advantage of hospitality and travel Intelligence by scheduled data extractions and receive your data right to your inbox.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
David Bernhard; David Bernhard; Luka Nenadic; Luka Nenadic; Stefan Bechtold; Karel Kubicek; Karel Kubicek; Stefan Bechtold (2025). Multilingual Scraper of Privacy Policies and Terms of Service [Dataset]. http://doi.org/10.5281/zenodo.14562039
Organization logo

Multilingual Scraper of Privacy Policies and Terms of Service

Explore at:
zip, binAvailable download formats
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
David Bernhard; David Bernhard; Luka Nenadic; Luka Nenadic; Stefan Bechtold; Karel Kubicek; Karel Kubicek; Stefan Bechtold
License

Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically

Description

Multilingual Scraper of Privacy Policies and Terms of Service: Scraped Documents of 2024

This dataset supplements publication "Multilingual Scraper of Privacy Policies and Terms of Service" at ACM CSLAW’25, March 25–27, 2025, München, Germany. It includes the first 12 months of scraped policies and terms from about 800k websites, see concrete numbers below.

The following table lists the amount of websites visited per month:

MonthNumber of websites
2024-01551'148
2024-02792'921
2024-03844'537
2024-04802'169
2024-05805'878
2024-06809'518
2024-07811'418
2024-08813'534
2024-09814'321
2024-10817'586
2024-11828'662
2024-12827'101

The amount of websites visited should always be higher than the number of jobs (Table 1 of the paper) as a website may redirect, resulting in two websites scraped or it has to be retried.

To simplify the access, we release the data in large CSVs. Namely, there is one file for policies and another for terms per month. All of these files contain all metadata that are usable for the analysis. If your favourite CSV parser reports the same numbers as above then our dataset is correctly parsed. We use ‘,’ as a separator, the first row is the heading and strings are in quotes.

Since our scraper sometimes collects other documents than policies and terms (for how often this happens, see the evaluation in Sec. 4 of the publication) that might contain personal data such as addresses of authors of websites that they maintain only for a selected audience. We therefore decided to reduce the risks for websites by anonymizing the data using Presidio. Presidio substitutes personal data with tokens. If your personal data has not been effectively anonymized from the database and you wish for it to be deleted, please contact us.

Preliminaries

The uncompressed dataset is about 125 GB in size, so you will need sufficient storage. This also means that you likely cannot process all the data at once in your memory, so we split the data in months and in files for policies and terms.

Files and structure

The files have the following names:

  • 2024_policy.csv for policies
  • 2024_terms.csv for terms

Shared metadata

Both files contain the following metadata columns:

  • website_month_id - identification of crawled website
  • job_id - one website can have multiple jobs in case of redirects (but most commonly has only one)
  • website_index_status - network state of loading the index page. This is resolved by the Chromed DevTools Protocol.
    • DNS_ERROR - domain cannot be resolved
    • OK - all fine
    • REDIRECT - domain redirect to somewhere else
    • TIMEOUT - the request timed out
    • BAD_CONTENT_TYPE - 415 Unsupported Media Type
    • HTTP_ERROR - 404 error
    • TCP_ERROR - error in the network connection
    • UNKNOWN_ERROR - unknown error
  • website_lang - language of index page detected based on langdetect library
  • website_url - the URL of the website sampled from the CrUX list (may contain subdomains, etc). Use this as a unique identifier for connecting data between months.
  • job_domain_status - indicates the status of loading the index page. Can be:
    • OK - all works well (at the moment, should be all entries)
    • BLACKLISTED - URL is on our list of blocked URLs
    • UNSAFE - website is not safe according to save browsing API by Google
    • LOCATION_BLOCKED - country is in the list of blocked countries
  • job_started_at - when the visit of the website was started
  • job_ended_at - when the visit of the website was ended
  • job_crux_popularity - JSON with all popularity ranks of the website this month
  • job_index_redirect - when we detect that the domain redirects us, we stop the crawl and create a new job with the target URL. This saves time if many websites redirect to one target, as it will be crawled only once. The index_redirect is then the job.id corresponding to the redirect target.
  • job_num_starts - amount of crawlers that started this job (counts restarts in case of unsuccessful crawl, max is 3)
  • job_from_static - whether this job was included in the static selection (see Sec. 3.3 of the paper)
  • job_from_dynamic - whether this job was included in the dynamic selection (see Sec. 3.3 of the paper) - this is not exclusive with from_static - both can be true when the lists overlap.
  • job_crawl_name - our name of the crawl, contains year and month (e.g., 'regular-2024-12' for regular crawls, in Dec 2024)

Policy data

  • policy_url_id - ID of the URL this policy has
  • policy_keyword_score - score (higher is better) according to the crawler's keywords list that given document is a policy
  • policy_ml_probability - probability assigned by the BERT model that given document is a policy
  • policy_consideration_basis - on which basis we decided that this url is policy. The following three options are executed by the crawler in this order:
    1. 'keyword matching' - this policy was found using the crawler navigation (which is based on keywords)
    2. 'search' - this policy was found using search engine
    3. 'path guessing' - this policy was found by using well-known URLs like example.com/policy
  • policy_url - full URL to the policy
  • policy_content_hash - used as identifier - if the document remained the same between crawls, it won't create a new entry
  • policy_content - contains the text of policies and terms extracted to Markdown using Mozilla's readability library
  • policy_lang - Language detected by fasttext of the content

Terms data

Analogous to policy data, just substitute policy to terms.

Updates

Check this Google Docs for an updated version of this README.md.

Search
Clear search
Close search
Google apps
Main menu