90 datasets found
  1. e

    Most popular websites in the Netherlands 2015 - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Jun 2, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). Most popular websites in the Netherlands 2015 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/3edeb59b-b49b-59cb-9757-9127aed7e8af
    Explore at:
    Dataset updated
    Jun 2, 2017
    Area covered
    Netherlands
    Description

    This dataset contains a list of 3654 Dutch websites that we considered the most popular websites in 2015. This list served as whitelist for the Newstracker Research project in which we monitored the online web behaviour of a group of respondents.The research project 'The Newstracker' was a subproject of the NWO-funded project 'The New News Consumer: A User-Based Innovation Project to Meet Paradigmatic Change in News Use and Media Habits'.For the Newstracker project we aimed to understand the web behaviour of a group of respondents. We created custom-built software to monitor their web browsing behaviour on their laptops and desktops (please find the code in open access at https://github.com/NITechLabs/NewsTracker). For reasons of scale and privacy we created a whitelist with websites that were the most popular websites in 2015. We manually compiled this list by using data of DDMM, Alexa and own research. The dataset consists of 5 columns:- the URL- the type of website: We created a list of types of websites and each website has been manually labeled with 1 category- Nieuws-regio: When the category was 'News', we subdivided these websites in the regional focus: International, National or Local- Nieuws-onderwerp: Furthermore, each website under the category News was further subdivided in type of news website. For this we created an own list of news categories and manually coded each website- Bron: For each website we noted which source we used to find this website.The full description of the research design of the Newstracker including the set-up of this whitelist is included in the following article: Kleppe, M., Otte, M. (in print), 'Analysing & understanding news consumption patterns by tracking online user behaviour with a multimodal research design', Digital Scholarship in the Humanities, doi 10.1093/llc/fqx030.

  2. Online Sales Dataset - Popular Marketplace Data

    • kaggle.com
    Updated May 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ShreyanshVerma27 (2024). Online Sales Dataset - Popular Marketplace Data [Dataset]. https://www.kaggle.com/datasets/shreyanshverma27/online-sales-dataset-popular-marketplace-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 25, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    ShreyanshVerma27
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset provides a comprehensive overview of online sales transactions across different product categories. Each row represents a single transaction with detailed information such as the order ID, date, category, product name, quantity sold, unit price, total price, region, and payment method.

    Columns:

    • Order ID: Unique identifier for each sales order.
    • Date:Date of the sales transaction.
    • Category:Broad category of the product sold (e.g., Electronics, Home Appliances, Clothing, Books, Beauty Products, Sports).
    • Product Name:Specific name or model of the product sold.
    • Quantity:Number of units of the product sold in the transaction.
    • Unit Price:Price of one unit of the product.
    • Total Price: Total revenue generated from the sales transaction (Quantity * Unit Price).
    • Region:Geographic region where the transaction occurred (e.g., North America, Europe, Asia).
    • Payment Method: Method used for payment (e.g., Credit Card, PayPal, Debit Card).

    Insights:

    • 1. Analyze sales trends over time to identify seasonal patterns or growth opportunities.
    • 2. Explore the popularity of different product categories across regions.
    • 3. Investigate the impact of payment methods on sales volume or revenue.
    • 4. Identify top-selling products within each category to optimize inventory and marketing strategies.
    • 5. Evaluate the performance of specific products or categories in different regions to tailor marketing campaigns accordingly.
  3. d

    Custom dataset from any website on the Internet

    • datarade.ai
    Updated Sep 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ScrapeLabs (2022). Custom dataset from any website on the Internet [Dataset]. https://datarade.ai/data-products/custom-dataset-from-any-website-on-the-internet-scrapelabs
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Sep 21, 2022
    Dataset authored and provided by
    ScrapeLabs
    Area covered
    Tunisia, Kazakhstan, Bulgaria, India, Jordan, Guinea-Bissau, Lebanon, Argentina, Turks and Caicos Islands, Aruba
    Description

    We'll extract any data from any website on the Internet. You don't have to worry about buying and maintaining complex and expensive software, or hiring developers.

    Some common use cases our customers use the data for: • Data Analysis • Market Research • Price Monitoring • Sales Leads • Competitor Analysis • Recruitment

    We can get data from websites with pagination or scroll, with captchas, and even from behind logins. Text, images, videos, documents.

    Receive data in any format you need: Excel, CSV, JSON, or any other.

  4. O

    Corporate Website — Analytics — Popular pages

    • data.qld.gov.au
    • researchdata.edu.au
    html
    Updated Sep 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brisbane City Council (2025). Corporate Website — Analytics — Popular pages [Dataset]. https://www.data.qld.gov.au/dataset/corporate-website-analytics-popular-pages
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Sep 22, 2025
    Dataset authored and provided by
    Brisbane City Council
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is available on Brisbane City Council’s open data website – data.brisbane.qld.gov.au. The site provides additional features for viewing and interacting with the data and for downloading the data in various formats.

    Monthly analytics reports for the Brisbane City Council website

    Information regarding the sessions for Brisbane City Council website during the month including page views and unique page views.

  5. Network Traffic Dataset

    • kaggle.com
    Updated Oct 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ravikumar Gattu (2023). Network Traffic Dataset [Dataset]. https://www.kaggle.com/datasets/ravikumargattu/network-traffic-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 31, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ravikumar Gattu
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    The data presented here was obtained in a Kali Machine from University of Cincinnati,Cincinnati,OHIO by carrying out packet captures for 1 hour during the evening on Oct 9th,2023 using Wireshark.This dataset consists of 394137 instances were obtained and stored in a CSV (Comma Separated Values) file.This large dataset could be used utilised for different machine learning applications for instance classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.

    The dataset can be used for a variety of machine learning tasks, such as network intrusion detection, traffic classification, and anomaly detection.

    Content :

    This network traffic dataset consists of 7 features.Each instance contains the information of source and destination IP addresses, The majority of the properties are numeric in nature, however there are also nominal and date kinds due to the Timestamp.

    The network traffic flow statistics (No. Time Source Destination Protocol Length Info) were obtained using Wireshark (https://www.wireshark.org/).

    Dataset Columns:

    No : Number of Instance. Timestamp : Timestamp of instance of network traffic Source IP: IP address of Source Destination IP: IP address of Destination Portocol: Protocol used by the instance Length: Length of Instance Info: Information of Traffic Instance

    Acknowledgements :

    I would like thank University of Cincinnati for giving the infrastructure for generation of network traffic data set.

    Ravikumar Gattu , Susmitha Choppadandi

    Inspiration : This dataset goes beyond the majority of network traffic classification datasets, which only identify the type of application (WWW, DNS, ICMP,ARP,RARP) that an IP flow contains. Instead, it generates machine learning models that can identify specific applications (like Tiktok,Wikipedia,Instagram,Youtube,Websites,Blogs etc.) from IP flow statistics (there are currently 25 applications in total).

    **Dataset License: ** CC0: Public Domain

    Dataset Usages : This dataset can be used for different machine learning applications in the field of cybersecurity such as classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.

    ML techniques benefits from this Dataset :

    This dataset is highly useful because it consists of 394137 instances of network traffic data obtained by using the 25 applications on a public,private and Enterprise networks.Also,the dataset consists of very important features that can be used for most of the applications of Machine learning in cybersecurity.Here are few of the potential machine learning applications that could be benefited from this dataset are :

    1. Network Performance Monitoring : This large network traffic data set can be utilised for analysing the network traffic to identifying the network patterns in the network .This help in designing the network security algorithms for minimise the network probelms.

    2. Anamoly Detection : Large network traffic dataset can be utilised training the machine learning models for finding the irregularitues in the traffic which could help identify the cyber attacks.

    3.Network Intrusion Detection : This large dataset could be utilised for machine algorithms training and designing the models for detection of the traffic issues,Malicious traffic network attacks and DOS attacks as well.

  6. Z

    Dataset used for HTTPS traffic classification using packet burst statistics

    • data.niaid.nih.gov
    Updated Apr 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hynek Karel (2022). Dataset used for HTTPS traffic classification using packet burst statistics [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4911550
    Explore at:
    Dataset updated
    Apr 11, 2022
    Dataset provided by
    Cejka Tomas
    Hynek Karel
    Tropkova Zdena
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We are publishing a dataset we created for the HTTPS traffic classification.

    Since the data were captured mainly in the real backbone network, we omitted IP addresses and ports. The datasets consist of calculated from bidirectional flows exported with flow probe Ipifixprobe. This exporter can export a sequence of packet lengths and times and a sequence of packet bursts and time. For more information, please visit ipfixprobe repository (Ipifixprobe).

    During our research, we divided HTTPS into five categories: L -- Live Video Streaming, P -- Video Player, M -- Music Player, U -- File Upload, D -- File Download, W -- Website, and other traffic.

    We have chosen the service representatives known for particular traffic types based on the Alexa Top 1M list and Moz's list of the most popular 500 websites for each category. We also used several popular websites that primarily focus on the audience in our country. The identified traffic classes and their representatives are provided below:

    Live Video Stream Twitch, Czech TV, YouTube Live

    Video Player DailyMotion, Stream.cz, Vimeo, YouTube

    Music Player AppleMusic, Spotify, SoundCloud

    File Upload/Download FileSender, OwnCloud, OneDrive, Google Drive

    Website and Other Traffic Websites from Alexa Top 1M list

  7. R

    Website Screenshots Object Detection Dataset - raw

    • public.roboflow.com
    zip
    Updated Aug 19, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brad Dwyer (2022). Website Screenshots Object Detection Dataset - raw [Dataset]. https://public.roboflow.com/object-detection/website-screenshots/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 19, 2022
    Dataset authored and provided by
    Brad Dwyer
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Variables measured
    Bounding Boxes of elements
    Description

    About This Dataset

    The Roboflow Website Screenshots dataset is a synthetically generated dataset composed of screenshots from over 1000 of the world's top websites. They have been automatically annotated to label the following classes: :fa-spacer: * button - navigation links, tabs, etc. * heading - text that was enclosed in <h1> to <h6> tags. * link - inline, textual <a> tags. * label - text labeling form fields. * text - all other text. * image - <img>, <svg>, or <video> tags, and icons. * iframe - ads and 3rd party content.

    Example

    This is an example image and annotation from the dataset: https://i.imgur.com/mOG3u3Z.png" alt="WIkipedia Screenshot">

    Usage

    Annotated screenshots are very useful in Robotic Process Automation. But they can be expensive to label. This dataset would cost over $4000 for humans to label on popular labeling services. We hope this dataset provides a good starting point for your project. Try it with a model from our model library.

    Collecting Custom Data

    Roboflow is happy to provide a custom screenshots dataset to meet your particular needs. We can crawl public or internal web applications. Just reach out and we'll be happy to provide a quote!

    About Roboflow

    Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless. :fa-spacer: Developers reduce 50% of their boilerplate code when using Roboflow's workflow, save training time, and increase model reproducibility. :fa-spacer:

    Roboflow Wordmark

  8. d

    Grepsr| Trip Advisor Property Address and Reviews | Global Coverage with...

    • datarade.ai
    Updated Jan 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Grepsr (2023). Grepsr| Trip Advisor Property Address and Reviews | Global Coverage with Custom and On-demand Datasets [Dataset]. https://datarade.ai/data-products/grepsr-trip-advisor-property-address-and-reviews-global-co-grepsr
    Explore at:
    .json, .csv, .xls, .txtAvailable download formats
    Dataset updated
    Jan 1, 2023
    Dataset authored and provided by
    Grepsr
    Area covered
    Greece, Holy See, Andorra, Benin, Myanmar, Croatia, Turkey, Italy, Sao Tome and Principe, Cuba
    Description

    A. Market Research and Analysis: Utilize the Tripadvisor dataset to conduct in-depth market research and analysis in the travel and hospitality industry. Identify emerging trends, popular destinations, and customer preferences. Gain a competitive edge by understanding your target audience's needs and expectations.

    B. Competitor Analysis: Compare and contrast your hotel or travel services with competitors on Tripadvisor. Analyze their ratings, customer reviews, and performance metrics to identify strengths and weaknesses. Use these insights to enhance your offerings and stand out in the market.

    C. Reputation Management: Monitor and manage your hotel's online reputation effectively. Track and analyze customer reviews and ratings on Tripadvisor to identify improvement areas and promptly address negative feedback. Positive reviews can be leveraged for marketing and branding purposes.

    D. Pricing and Revenue Optimization: Leverage the Tripadvisor dataset to analyze pricing strategies and revenue trends in the hospitality sector. Understand seasonal demand fluctuations, pricing patterns, and revenue optimization opportunities to maximize your hotel's profitability.

    E. Customer Sentiment Analysis: Conduct sentiment analysis on Tripadvisor reviews to gauge customer satisfaction and sentiment towards your hotel or travel service. Use this information to improve guest experiences, address pain points, and enhance overall customer satisfaction.

    F. Content Marketing and SEO: Create compelling content for your hotel or travel website based on the popular keywords, topics, and interests identified in the Tripadvisor dataset. Optimize your content to improve search engine rankings and attract more potential guests.

    G. Personalized Marketing Campaigns: Use the data to segment your target audience based on preferences, travel habits, and demographics. Develop personalized marketing campaigns that resonate with different customer segments, resulting in higher engagement and conversions.

    H. Investment and Expansion Decisions: Access historical and real-time data on hotel performance and market dynamics from Tripadvisor. Utilize this information to make data-driven investment decisions, identify potential areas for expansion, and assess the feasibility of new ventures.

    I. Predictive Analytics: Utilize the dataset to build predictive models that forecast future trends in the travel industry. Anticipate demand fluctuations, understand customer behavior, and make proactive decisions to stay ahead of the competition.

    J. Business Intelligence Dashboards: Create interactive and insightful dashboards that visualize key performance metrics from the Tripadvisor dataset. These dashboards can help executives and stakeholders get a quick overview of the hotel's performance and make data-driven decisions.

    Incorporating the Tripadvisor dataset into your business processes will enhance your understanding of the travel market, facilitate data-driven decision-making, and provide valuable insights to drive success in the competitive hospitality industry

  9. m

    LegitPhish Dataset

    • data.mendeley.com
    Updated Apr 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rachana Potpelwar (2025). LegitPhish Dataset [Dataset]. http://doi.org/10.17632/hx4m73v2sf.1
    Explore at:
    Dataset updated
    Apr 7, 2025
    Authors
    Rachana Potpelwar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains 101,219 URLs and 18 features (including the label). Here's a description of each attribute: Phishing (0): 63,678 URLs

    Legitimate (1): 37,540 URLs

    These URLs have been sourced from the URLHaus database, scraped from many sites and other well-known repositories malicious websites actively used in phishing attacks. Each entry in this subset has been manually verified and is labeled as a phishing URL, making this dataset highly reliable for identifying harmful web content.

    The legitimate URLs have been collected from reputable sources such as Wikipedia and Stack Overflow. These websites are known for hosting user-generated content and community discussions, ensuring that the URLs represent safe, legitimate web addresses. The URLs were randomly scraped to ensure diversity in the types of legitimate sites included. Dataset Features:

    URL: The full web address of each entry, providing the primary feature for analysis. Label: A binary label indicating whether the URL is legitimate (1) or phishing (0). Applications:

    This dataset is suitable for training and evaluating machine learning models aimed at distinguishing between phishing and legitimate websites. It can be used in a variety of cybersecurity research projects, including URL-based phishing detection, web content analysis, and the development of real-time protection systems.

    Usage:

    Researchers can leverage this balanced dataset to develop and test algorithms for identifying phishing websites with high accuracy, using features such as URL structure, and class label attributes. The inclusion of both phishing and legitimate URLs provides a comprehensive basis for creating robust models capable of detecting phishing attempts in diverse online environments.

    Feature Name Description URL The full URL string. url_length - Total number of characters in the URL. has_ip_address - Binary flag (1/0): whether the URL contains an IP address. dot_count - Number of . characters in the URL. https_flag - Binary flag (1/0): whether the URL uses HTTPS. url_entropy - Shannon entropy of the URL string – higher values indicate more randomness. token_count - Number of tokens/words in the URL. subdomain_count - Number of subdomains in the URL. query_param_count - Number of query parameters (after ?). tld_length - Length of the Top-Level Domain (e.g., "com" = 3). path_length - Length of the path part after the domain. has_hyphen_in_domain Binary flag (1/0): whether the domain contains a hyphen (-). number_of_digits - Total number of numeric characters in the URL. tld_popularity Binary flag (1/0): whether the TLD is popular. suspicious_file_extension Binary flag (1/0): indicates if the URL ends with suspicious extensions (e.g., .exe, .zip). domain_name_length - Length of the domain name. percentage_numeric_chars - Percentage of numeric characters in the URL. ClassLabel Target label: 1 = Legitimate, 0 = Phishing.

  10. Chrome User Experience Report (India Only)

    • kaggle.com
    zip
    Updated Feb 12, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google BigQuery (2019). Chrome User Experience Report (India Only) [Dataset]. https://www.kaggle.com/bigquery/chrome-ux-report-country-in
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Feb 12, 2019
    Dataset provided by
    Googlehttp://google.com/
    BigQueryhttps://cloud.google.com/bigquery
    Authors
    Google BigQuery
    Area covered
    India
    Description

    Context

    Google Chrome is a popular web browser developed by Google.

    Content

    The Chrome User Experience Report is a public dataset of key user experience metrics for popular origins on the web, as experienced by Chrome users under real-world conditions.

    Acknowledgements

    https://bigquery.cloud.google.com/dataset/chrome-ux-report:all

    For more info, see the documentation at https://developers.google.com/web/tools/chrome-user-experience-report/

    License: CC BY 4.0

    Photo by Edho Pratama on Unsplash

  11. Multilingual Scraper of Privacy Policies and Terms of Service

    • zenodo.org
    bin, zip
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Bernhard; David Bernhard; Luka Nenadic; Luka Nenadic; Stefan Bechtold; Karel Kubicek; Karel Kubicek; Stefan Bechtold (2025). Multilingual Scraper of Privacy Policies and Terms of Service [Dataset]. http://doi.org/10.5281/zenodo.14562039
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    David Bernhard; David Bernhard; Luka Nenadic; Luka Nenadic; Stefan Bechtold; Karel Kubicek; Karel Kubicek; Stefan Bechtold
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Multilingual Scraper of Privacy Policies and Terms of Service: Scraped Documents of 2024

    This dataset supplements publication "Multilingual Scraper of Privacy Policies and Terms of Service" at ACM CSLAW’25, March 25–27, 2025, München, Germany. It includes the first 12 months of scraped policies and terms from about 800k websites, see concrete numbers below.

    The following table lists the amount of websites visited per month:

    MonthNumber of websites
    2024-01551'148
    2024-02792'921
    2024-03844'537
    2024-04802'169
    2024-05805'878
    2024-06809'518
    2024-07811'418
    2024-08813'534
    2024-09814'321
    2024-10817'586
    2024-11828'662
    2024-12827'101

    The amount of websites visited should always be higher than the number of jobs (Table 1 of the paper) as a website may redirect, resulting in two websites scraped or it has to be retried.

    To simplify the access, we release the data in large CSVs. Namely, there is one file for policies and another for terms per month. All of these files contain all metadata that are usable for the analysis. If your favourite CSV parser reports the same numbers as above then our dataset is correctly parsed. We use ‘,’ as a separator, the first row is the heading and strings are in quotes.

    Since our scraper sometimes collects other documents than policies and terms (for how often this happens, see the evaluation in Sec. 4 of the publication) that might contain personal data such as addresses of authors of websites that they maintain only for a selected audience. We therefore decided to reduce the risks for websites by anonymizing the data using Presidio. Presidio substitutes personal data with tokens. If your personal data has not been effectively anonymized from the database and you wish for it to be deleted, please contact us.

    Preliminaries

    The uncompressed dataset is about 125 GB in size, so you will need sufficient storage. This also means that you likely cannot process all the data at once in your memory, so we split the data in months and in files for policies and terms.

    Files and structure

    The files have the following names:

    • 2024_policy.csv for policies
    • 2024_terms.csv for terms

    Shared metadata

    Both files contain the following metadata columns:

    • website_month_id - identification of crawled website
    • job_id - one website can have multiple jobs in case of redirects (but most commonly has only one)
    • website_index_status - network state of loading the index page. This is resolved by the Chromed DevTools Protocol.
      • DNS_ERROR - domain cannot be resolved
      • OK - all fine
      • REDIRECT - domain redirect to somewhere else
      • TIMEOUT - the request timed out
      • BAD_CONTENT_TYPE - 415 Unsupported Media Type
      • HTTP_ERROR - 404 error
      • TCP_ERROR - error in the network connection
      • UNKNOWN_ERROR - unknown error
    • website_lang - language of index page detected based on langdetect library
    • website_url - the URL of the website sampled from the CrUX list (may contain subdomains, etc). Use this as a unique identifier for connecting data between months.
    • job_domain_status - indicates the status of loading the index page. Can be:
      • OK - all works well (at the moment, should be all entries)
      • BLACKLISTED - URL is on our list of blocked URLs
      • UNSAFE - website is not safe according to save browsing API by Google
      • LOCATION_BLOCKED - country is in the list of blocked countries
    • job_started_at - when the visit of the website was started
    • job_ended_at - when the visit of the website was ended
    • job_crux_popularity - JSON with all popularity ranks of the website this month
    • job_index_redirect - when we detect that the domain redirects us, we stop the crawl and create a new job with the target URL. This saves time if many websites redirect to one target, as it will be crawled only once. The index_redirect is then the job.id corresponding to the redirect target.
    • job_num_starts - amount of crawlers that started this job (counts restarts in case of unsuccessful crawl, max is 3)
    • job_from_static - whether this job was included in the static selection (see Sec. 3.3 of the paper)
    • job_from_dynamic - whether this job was included in the dynamic selection (see Sec. 3.3 of the paper) - this is not exclusive with from_static - both can be true when the lists overlap.
    • job_crawl_name - our name of the crawl, contains year and month (e.g., 'regular-2024-12' for regular crawls, in Dec 2024)

    Policy data

    • policy_url_id - ID of the URL this policy has
    • policy_keyword_score - score (higher is better) according to the crawler's keywords list that given document is a policy
    • policy_ml_probability - probability assigned by the BERT model that given document is a policy
    • policy_consideration_basis - on which basis we decided that this url is policy. The following three options are executed by the crawler in this order:
      1. 'keyword matching' - this policy was found using the crawler navigation (which is based on keywords)
      2. 'search' - this policy was found using search engine
      3. 'path guessing' - this policy was found by using well-known URLs like example.com/policy
    • policy_url - full URL to the policy
    • policy_content_hash - used as identifier - if the document remained the same between crawls, it won't create a new entry
    • policy_content - contains the text of policies and terms extracted to Markdown using Mozilla's readability library
    • policy_lang - Language detected by fasttext of the content

    Terms data

    Analogous to policy data, just substitute policy to terms.

    Updates

    Check this Google Docs for an updated version of this README.md.

  12. c

    ckanext-dapr

    • catalog.civicdataecosystem.org
    Updated Jun 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). ckanext-dapr [Dataset]. https://catalog.civicdataecosystem.org/dataset/ckanext-dapr
    Explore at:
    Dataset updated
    Jun 4, 2025
    Description

    The DAPR extension for CKAN integrates with the Digital Analytics Program (DAP) to retrieve and display usage statistics for datasets and resources within a CKAN instance. This extension allows administrators to monitor how frequently datasets are accessed, providing valuable insights into data usage patterns. By tracking download events, DAPR enriches CKAN's functionality, facilitating data-driven decision-making and resource management. Key Features: DAP Download Event Retrieval: Retrieves download events tracked by the Digital Analytics Program and stores them within CKAN for later analysis and display. This ensures that access data is captured and made available alongside other dataset metadata. Frequently Accessed Dataset Listing: Enables the creation of lists showcasing the most frequently accessed datasets. This allows administrators to identify popular datasets and prioritize resources accordingly. Dataset and Resource Access Counting: Provides a mechanism to display access counts for datasets and individual resources. These counts can be displayed within the CKAN interface, providing users with immediate feedback on dataset popularity. DAP-enabled Website Event Tracking: Tracks accesses to resources even when those accesses originate from external DAP-enabled websites, providing a comprehensive view of data usage regardless of the access point. Scheduled Data Refresh: Supports command-line utility for scheduled access data imports, ensuring usage statistics remain up-to-date with minimal manual intervention. It has options to specify the start and end date or retrieve all records. Technical Integration: The DAPR extension integrates with CKAN by adding new database tables to store DAP tracking data. Administrators configure the extension through the CKAN configuration file (ckan.ini) or similar. The extension also provides a command-line interface (CLI) tool for importing DAP tracking events, which can be scheduled using cron or similar task schedulers. Benefits & Impact: By integrating DAP statistics, the DAPR extension allows CKAN instance owners to improve data visibility, assess the value of available datasets, and make data-informed decisions about resource allocation and data discoverability improvements. Knowing which datasets are frequently used and accessed can help data curators prioritize updates, augment popular datasets with further information, and overall invest resources into ensuring that the CKAN instance delivers the most relevant and impactful data to its audience.

  13. Websites Quality In Use Evaluation Dataset

    • zenodo.org
    zip
    Updated Feb 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michail D. Papamichail; Michail D. Papamichail; Themistoklis Diamantopoulos; Themistoklis Diamantopoulos; Kyriakos C. Chatzidimitriou; Kyriakos C. Chatzidimitriou; Andreas Symeonidis; Andreas Symeonidis (2020). Websites Quality In Use Evaluation Dataset [Dataset]. http://doi.org/10.5281/zenodo.2556517
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 7, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Michail D. Papamichail; Michail D. Papamichail; Themistoklis Diamantopoulos; Themistoklis Diamantopoulos; Kyriakos C. Chatzidimitriou; Kyriakos C. Chatzidimitriou; Andreas Symeonidis; Andreas Symeonidis
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains a number of computed dynamic analysis metrics related to quality in use for the 5,000 most popular websites.

  14. Airlines Flights Data

    • kaggle.com
    Updated Jul 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Science Lovers (2025). Airlines Flights Data [Dataset]. https://www.kaggle.com/datasets/rohitgrewal/airlines-flights-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 29, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Data Science Lovers
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    📹Project Video available on YouTube - https://youtu.be/gu3Ot78j_Gc

    Airlines Flights Dataset for Different Cities

    The Flights Booking Dataset of various Airlines is a scraped datewise from a famous website in a structured format. The dataset contains the records of flight travel details between the cities in India. Here, multiple features are present like Source & Destination City, Arrival & Departure Time, Duration & Price of the flight etc.

    This data is available as a CSV file. We are going to analyze this data set using the Pandas DataFrame.

    This analyse will be helpful for those working in Airlines, Travel domain.

    Using this dataset, we answered multiple questions with Python in our Project.

    Q.1. What are the airlines in the dataset, accompanied by their frequencies?

    Q.2. Show Bar Graphs representing the Departure Time & Arrival Time.

    Q.3. Show Bar Graphs representing the Source City & Destination City.

    Q.4. Does price varies with airlines ?

    Q.5. Does ticket price change based on the departure time and arrival time?

    Q.6. How the price changes with change in Source and Destination?

    Q.7. How is the price affected when tickets are bought in just 1 or 2 days before departure?

    Q.8. How does the ticket price vary between Economy and Business class?

    Q.9. What will be the Average Price of Vistara airline for a flight from Delhi to Hyderabad in Business Class ?

    These are the main Features/Columns available in the dataset :

    1) Airline: The name of the airline company is stored in the airline column. It is a categorical feature having 6 different airlines.

    2) Flight: Flight stores information regarding the plane's flight code. It is a categorical feature.

    3) Source City: City from which the flight takes off. It is a categorical feature having 6 unique cities.

    4) Departure Time: This is a derived categorical feature obtained created by grouping time periods into bins. It stores information about the departure time and have 6 unique time labels.

    5) Stops: A categorical feature with 3 distinct values that stores the number of stops between the source and destination cities.

    6) Arrival Time: This is a derived categorical feature created by grouping time intervals into bins. It has six distinct time labels and keeps information about the arrival time.

    7) Destination City: City where the flight will land. It is a categorical feature having 6 unique cities.

    8) Class: A categorical feature that contains information on seat class; it has two distinct values: Business and Economy.

    9) Duration: A continuous feature that displays the overall amount of time it takes to travel between cities in hours.

    10) Days Left: This is a derived characteristic that is calculated by subtracting the trip date by the booking date.

    11) Price: Target variable stores information of the ticket price.

  15. Popularity Dataset for Online Stats Training

    • zenodo.org
    bin
    Updated Aug 25, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rens van de Schoot; Rens van de Schoot (2020). Popularity Dataset for Online Stats Training [Dataset]. http://doi.org/10.5281/zenodo.3962123
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 25, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Rens van de Schoot; Rens van de Schoot
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a dataset used for the online stats training website (https://www.rensvandeschoot.com/tutorials/) and is based on the data used by van de Schoot, van der Velden, Boom, and Brugman (2010).

    The dataset is based on a study that investigates an association between popularity status and antisocial behavior from at-risk adolescents (n = 1491), where gender and ethnic background are moderators under the association. The study distinguished subgroups within the popular status group in terms of overt and covert antisocial behavior.For more information on the sample, instruments, methodology, and research context, we refer the interested readers to van de Schoot, van der Velden, Boom, and Brugman (2010).

    Variable name Description

    Respnr = Respondents’ number

    Dutch = Respondents’ ethnic background (0 = Dutch origin, 1 = non-Dutch origin)

    gender = Respondents’ gender (0 = boys, 1 = girls)

    sd = Adolescents’ socially desirable answering patterns

    covert = Covert antisocial behavior

    overt = Overt antisocial behavior

  16. Global Starlink Web Cache Latency & Traceroute Measurement Dataset

    • zenodo.org
    Updated Feb 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qi Zhang; Qi Zhang; Zeqi Lai; Zeqi Lai; Qian Wu; Qian Wu; Jihao Li; Jihao Li; HEWU LI; HEWU LI (2025). Global Starlink Web Cache Latency & Traceroute Measurement Dataset [Dataset]. http://doi.org/10.5281/zenodo.14800115
    Explore at:
    Dataset updated
    Feb 6, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Qi Zhang; Qi Zhang; Zeqi Lai; Zeqi Lai; Qian Wu; Qian Wu; Jihao Li; Jihao Li; HEWU LI; HEWU LI
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains global web cache latency measurements collected via RIPE Atlas probes equipped with Starlink terminals across five continents, spanning over 24 hours and resulting in ~2 Million measurements. The measurements aim to evaluate the user-perceived latency of accessing popular websites through low-earth orbit (LEO) satellite networks.

    This dataset is a product of Spache, a research project on web caching from space. Please refer to its WWW'25 paper for more details and analysis results.

    Dataset File Content

    The dataset includes the following files:

    • Metadata

      • Target website list: A list of the top 50 most popular websites according to Alexa ranking.
        • RIPE Atlas Measurement IDs: For each website, the corresponding RIPE Atlas Measurement IDs for both Ping and Traceroute measurements are provided.
        • Note: microsoftonline.com (originally ranked 41st) is not included in the list due to its unresolvable domain name.
    • Measurement results - Raw Data

      • Ping and Traceroute results: Raw measurement results for each target website, including detailed information on each measurement.
    • Measurement results - Preprocessed Latency

      • Ping RTT latency: Preprocessed data containing the minimum RTT (Round Trip Time, in milliseconds) for each Ping measurement to all target websites.
        • Probe information: Corresponding Probe IDs, along with their respective countries and continents at the time of measurement.

    This dataset is intended to support research on web caching, particularly in the context of satellite Internet. Please cite both this dataset and the associated paper if you find this data useful.

  17. Machine Learning Dataset

    • brightdata.com
    .json, .csv, .xlsx
    Updated Jun 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data (2024). Machine Learning Dataset [Dataset]. https://brightdata.com/products/datasets/machine-learning
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset updated
    Jun 19, 2024
    Dataset authored and provided by
    Bright Datahttps://brightdata.com/
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    Worldwide
    Description

    Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.

  18. Z

    TED dataset

    • data.niaid.nih.gov
    Updated Oct 6, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Popescu-Belis, Andrei (2020). TED dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4061423
    Explore at:
    Dataset updated
    Oct 6, 2020
    Dataset provided by
    Pappas, Nikolaos
    Popescu-Belis, Andrei
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A dataset for recommendations collected from ted.com which contains metadata fields for TED talks and user profiles with rating and commenting transactions.

    The TED dataset contains all the audio-video recordings of the TED talks downloaded from the official TED website, http://www.ted.com, on April 27th 2012 (first version) and on September 10th 2012 (second version). No processing has been done on any of the metadata fields. The metadata was obtained by crawling the HTML source of the list of talks and users, as well as talk and user webpages using scripts written by Nikolaos Pappas at the Idiap Research Institute, Martigny, Switzerland. The dataset is shared under the Creative Commons license (the same as the content of the TED talks) which is stored in the COPYRIGHT file. The dataset is shared for research purposes which are explained in detail in the following papers. The dataset can be used to benchmark systems that perform two tasks, namely personalized recommendations and generic recommendations. Please check the CBMI 2013 paper for a detailed description of each task.

    Nikolaos Pappas, Andrei Popescu-Belis, "Combining Content with User Preferences for TED Lecture Recommendation", 11th International Workshop on Content Based Multimedia Indexing, Veszprém, Hungary, IEEE, 2013 PDF document, Bibtex citation

    Nikolaos Pappas, Andrei Popescu-Belis, Sentiment Analysis of User Comments for One-Class Collaborative Filtering over TED Talks, 36th ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, ACM, 2013 PDF document, Bibtex citation

    If you use the TED dataset for your research please cite one of the above papers (specifically the 1st paper for the April 2012 version and the 2nd paper for the September 2012 version of the dataset).

    TED website

    The TED website is a popular online repository of audiovisual recordings of public lectures given by prominent speakers, under a Creative Commons non-commercial license (see www.ted.com). The site provides extended metadata and user-contributed material. The speakers are scientists, writers, journalists, artists, and businesspeople from all over the world who are generally given a maximum of 18 minutes to present their ideas. The talks are given in English and are usually transcribed and then translated into several other languages by volunteer users. The quality of the talks has made TED one of the most popular online lecture repositories, as each talk was viewed on average almost 500,000 times.

    Metadata

    The dataset contains two main entry types: talks and users. The talks have the following data fields: identifier, title, description, speaker name, TED event at which they were given, transcript, publication date, filming date, number of views. Each talk has a variable number of user comments, organized in threads. In addition, three fields were assigned by TED editorial staff: related tags, related themes, and related talks. Each talk generally has three related talks and 95% of them have a high- quality transcript available. The dataset includes 1,149 talks from 960 speakers and 69,023 registered users that have made about 100,000 favorites and 200,000 comments.

  19. R

    Reorganized2 Dataset

    • universe.roboflow.com
    zip
    Updated Apr 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    bruce baur (2023). Reorganized2 Dataset [Dataset]. https://universe.roboflow.com/bruce-baur/reorganized2/dataset/2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 27, 2023
    Dataset authored and provided by
    bruce baur
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Wegpages Bounding Boxes
    Description

    Here are a few use cases for this project:

    1. Web Accessibility Analysis: This model can be used to analyze the accessibility of web pages by identifying different elements and ensuring they follow good practices in design and user accessibility standards, such as having appropriate contrast between text and image, or usage of icons and buttons for UI/UX.

    2. Web Page Redesign: By identifying the classes of elements on a webpage, "Reorganized2" could be used by designers and developers to analyze a current website layout and assist in redesigning a more intuitive and user-friendly interface.

    3. UX Research and Testing: The model can be utilized in user experience (UX) research. It can help in identifying which elements (buttons, icons, dropdowns) on a webpage are getting more attention thus allowing UX designers to create more effective webpages.

    4. Web Scraping: In the field of data mining, the model can serve as a smart web scraper, identifying different elements on a page, thus making web scraping more efficient and targeted rather than pulling irrelevant information.

    5. E-commerce Optimization: "Reorganized2" can be used to analyze various e-commerce websites, spotting common design features amongst the most successful ones, especially regarding the usage and placement of 'cart', 'field', and 'dropdown' elements. These insights can be used to optimize other online retail sites.

  20. z

    Requirements data sets (user stories)

    • zenodo.org
    • data.mendeley.com
    txt
    Updated Jan 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabiano Dalpiaz; Fabiano Dalpiaz (2025). Requirements data sets (user stories) [Dataset]. http://doi.org/10.17632/7zbk8zsd8y.1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 13, 2025
    Dataset provided by
    Mendeley Data
    Authors
    Fabiano Dalpiaz; Fabiano Dalpiaz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A collection of 22 data set of 50+ requirements each, expressed as user stories.

    The dataset has been created by gathering data from web sources and we are not aware of license agreements or intellectual property rights on the requirements / user stories. The curator took utmost diligence in minimizing the risks of copyright infringement by using non-recent data that is less likely to be critical, by sampling a subset of the original requirements collection, and by qualitatively analyzing the requirements. In case of copyright infringement, please contact the dataset curator (Fabiano Dalpiaz, f.dalpiaz@uu.nl) to discuss the possibility of removal of that dataset [see Zenodo's policies]

    The data sets have been originally used to conduct experiments about ambiguity detection with the REVV-Light tool: https://github.com/RELabUU/revv-light

    This collection has been originally published in Mendeley data: https://data.mendeley.com/datasets/7zbk8zsd8y/1

    Overview of the datasets [data and links added in December 2024]

    The following text provides a description of the datasets, including links to the systems and websites, when available. The datasets are organized by macro-category and then by identifier.

    Public administration and transparency

    g02-federalspending.txt (2018) originates from early data in the Federal Spending Transparency project, which pertain to the website that is used to share publicly the spending data for the U.S. government. The website was created because of the Digital Accountability and Transparency Act of 2014 (DATA Act). The specific dataset pertains a system called DAIMS or Data Broker, which stands for DATA Act Information Model Schema. The sample that was gathered refers to a sub-project related to allowing the government to act as a data broker, thereby providing data to third parties. The data for the Data Broker project is currently not available online, although the backend seems to be hosted in GitHub under a CC0 1.0 Universal license. Current and recent snapshots of federal spending related websites, including many more projects than the one described in the shared collection, can be found here.

    g03-loudoun.txt (2018) is a set of extracted requirements from a document, by the Loudoun County Virginia, that describes the to-be user stories and use cases about a system for land management readiness assessment called Loudoun County LandMARC. The source document can be found here and it is part of the Electronic Land Management System and EPlan Review Project - RFP RFQ issued in March 2018. More information about the overall LandMARC system and services can be found here.

    g04-recycling.txt(2017) concerns a web application where recycling and waste disposal facilities can be searched and located. The application operates through the visualization of a map that the user can interact with. The dataset has obtained from a GitHub website and it is at the basis of a students' project on web site design; the code is available (no license).

    g05-openspending.txt (2018) is about the OpenSpending project (www), a project of the Open Knowledge foundation which aims at transparency about how local governments spend money. At the time of the collection, the data was retrieved from a Trello board that is currently unavailable. The sample focuses on publishing, importing and editing datasets, and how the data should be presented. Currently, OpenSpending is managed via a GitHub repository which contains multiple sub-projects with unknown license.

    g11-nsf.txt (2018) refers to a collection of user stories referring to the NSF Site Redesign & Content Discovery project, which originates from a publicly accessible GitHub repository (GPL 2.0 license). In particular, the user stories refer to an early version of the NSF's website. The user stories can be found as closed Issues.

    (Research) data and meta-data management

    g08-frictionless.txt (2016) regards the Frictionless Data project, which offers an open source dataset for building data infrastructures, to be used by researchers, data scientists, and data engineers. Links to the many projects within the Frictionless Data project are on GitHub (with a mix of Unlicense and MIT license) and web. The specific set of user stories has been collected in 2016 by GitHub user @danfowler and are stored in a Trello board.

    g14-datahub.txt (2013) concerns the open source project DataHub, which is currently developed via a GitHub repository (the code has Apache License 2.0). DataHub is a data discovery platform which has been developed over multiple years. The specific data set is an initial set of user stories, which we can date back to 2013 thanks to a comment therein.

    g16-mis.txt (2015) is a collection of user stories that pertains a repository for researchers and archivists. The source of the dataset is a public Trello repository. Although the user stories do not have explicit links to projects, it can be inferred that the stories originate from some project related to the library of Duke University.

    g17-cask.txt (2016) refers to the Cask Data Application Platform (CDAP). CDAP is an open source application platform (GitHub, under Apache License 2.0) that can be used to develop applications within the Apache Hadoop ecosystem, an open-source framework which can be used for distributed processing of large datasets. The user stories are extracted from a document that includes requirements regarding dataset management for Cask 4.0, which includes the scenarios, user stories and a design for the implementation of these user stories. The raw data is available in the following environment.

    g18-neurohub.txt (2012) is concerned with the NeuroHub platform, a neuroscience data management, analysis and collaboration platform for researchers in neuroscience to collect, store, and share data with colleagues or with the research community. The user stories were collected at a time NeuroHub was still a research project sponsored by the UK Joint Information Systems Committee (JISC). For information about the research project from which the requirements were collected, see the following record.

    g22-rdadmp.txt (2018) is a collection of user stories from the Research Data Alliance's working group on DMP Common Standards. Their GitHub repository contains a collection of user stories that were created by asking the community to suggest functionality that should part of a website that manages data management plans. Each user story is stored as an issue on the GitHub's page.

    g23-archivesspace.txt (2012-2013) refers to ArchivesSpace: an open source, web application for managing archives information. The application is designed to support core functions in archives administration such as accessioning; description and arrangement of processed materials including analog, hybrid, and
    born digital content; management of authorities and rights; and reference service. The application supports collection management through collection management records, tracking of events, and a growing number of administrative reports. ArchivesSpace is open source and its

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2017). Most popular websites in the Netherlands 2015 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/3edeb59b-b49b-59cb-9757-9127aed7e8af

Most popular websites in the Netherlands 2015 - Dataset - B2FIND

Explore at:
Dataset updated
Jun 2, 2017
Area covered
Netherlands
Description

This dataset contains a list of 3654 Dutch websites that we considered the most popular websites in 2015. This list served as whitelist for the Newstracker Research project in which we monitored the online web behaviour of a group of respondents.The research project 'The Newstracker' was a subproject of the NWO-funded project 'The New News Consumer: A User-Based Innovation Project to Meet Paradigmatic Change in News Use and Media Habits'.For the Newstracker project we aimed to understand the web behaviour of a group of respondents. We created custom-built software to monitor their web browsing behaviour on their laptops and desktops (please find the code in open access at https://github.com/NITechLabs/NewsTracker). For reasons of scale and privacy we created a whitelist with websites that were the most popular websites in 2015. We manually compiled this list by using data of DDMM, Alexa and own research. The dataset consists of 5 columns:- the URL- the type of website: We created a list of types of websites and each website has been manually labeled with 1 category- Nieuws-regio: When the category was 'News', we subdivided these websites in the regional focus: International, National or Local- Nieuws-onderwerp: Furthermore, each website under the category News was further subdivided in type of news website. For this we created an own list of news categories and manually coded each website- Bron: For each website we noted which source we used to find this website.The full description of the research design of the Newstracker including the set-up of this whitelist is included in the following article: Kleppe, M., Otte, M. (in print), 'Analysing & understanding news consumption patterns by tracking online user behaviour with a multimodal research design', Digital Scholarship in the Humanities, doi 10.1093/llc/fqx030.

Search
Clear search
Close search
Google apps
Main menu