32 datasets found
  1. Daily website visitors (time series regression)

    • kaggle.com
    Updated Aug 20, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bob Nau (2020). Daily website visitors (time series regression) [Dataset]. https://www.kaggle.com/bobnau/daily-website-visitors/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 20, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Bob Nau
    Description

    Context

    This file contains 5 years of daily time series data for several measures of traffic on a statistical forecasting teaching notes website whose alias is statforecasting.com. The variables have complex seasonality that is keyed to the day of the week and to the academic calendar. The patterns you you see here are similar in principle to what you would see in other daily data with day-of-week and time-of-year effects. Some good exercises are to develop a 1-day-ahead forecasting model, a 7-day ahead forecasting model, and an entire-next-week forecasting model (i.e., next 7 days) for unique visitors.

    Content

    The variables are daily counts of page loads, unique visitors, first-time visitors, and returning visitors to an academic teaching notes website. There are 2167 rows of data spanning the date range from September 14, 2014, to August 19, 2020. A visit is defined as a stream of hits on one or more pages on the site on a given day by the same user, as identified by IP address. Multiple individuals with a shared IP address (e.g., in a computer lab) are considered as a single user, so real users may be undercounted to some extent. A visit is classified as "unique" if a hit from the same IP address has not come within the last 6 hours. Returning visitors are identified by cookies if those are accepted. All others are classified as first-time visitors, so the count of unique visitors is the sum of the counts of returning and first-time visitors by definition. The data was collected through a traffic monitoring service known as StatCounter.

    Inspiration

    This file and a number of other sample datasets can also be found on the website of RegressIt, a free Excel add-in for linear and logistic regression which I originally developed for use in the course whose website generated the traffic data given here. If you use Excel to some extent as well as Python or R, you might want to try it out on this dataset.

  2. d

    Open Data Website Traffic

    • catalog.data.gov
    • data.lacity.org
    • +1more
    Updated Jun 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.lacity.org (2025). Open Data Website Traffic [Dataset]. https://catalog.data.gov/dataset/open-data-website-traffic
    Explore at:
    Dataset updated
    Jun 21, 2025
    Dataset provided by
    data.lacity.org
    Description

    Daily utilization metrics for data.lacity.org and geohub.lacity.org. Updated monthly

  3. Website Statistics

    • data.wu.ac.at
    • data.europa.eu
    csv, pdf
    Updated Jun 11, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lincolnshire County Council (2018). Website Statistics [Dataset]. https://data.wu.ac.at/schema/data_gov_uk/M2ZkZDBjOTUtMzNhYi00YWRjLWI1OWMtZmUzMzA5NjM0ZTdk
    Explore at:
    csv, pdfAvailable download formats
    Dataset updated
    Jun 11, 2018
    Dataset provided by
    Lincolnshire County Councilhttp://www.lincolnshire.gov.uk/
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Description

    This Website Statistics dataset has four resources showing usage of the Lincolnshire Open Data website. Web analytics terms used in each resource are defined in their accompanying Metadata file.

    • Website Usage Statistics: This document shows a statistical summary of usage of the Lincolnshire Open Data site for the latest calendar year.

    • Website Statistics Summary: This dataset shows a website statistics summary for the Lincolnshire Open Data site for the latest calendar year.

    • Webpage Statistics: This dataset shows statistics for individual Webpages on the Lincolnshire Open Data site by calendar year.

    • Dataset Statistics: This dataset shows cumulative totals for Datasets on the Lincolnshire Open Data site that have also been published on the national Open Data site Data.Gov.UK - see the Source link.

      Note: Website and Webpage statistics (the first three resources above) show only UK users, and exclude API calls (automated requests for datasets). The Dataset Statistics are confined to users with javascript enabled, which excludes web crawlers and API calls.

    These Website Statistics resources are updated annually in January by the Lincolnshire County Council Business Intelligence team. For any enquiries about the information contact opendata@lincolnshire.gov.uk.

  4. g

    Statistics, compilation of visits to TCN websites

    • gimi9.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statistics, compilation of visits to TCN websites [Dataset]. https://gimi9.com/dataset/eu_01c2fafa-25cb-4aae-9729-b86ee4851a6b/
    Explore at:
    Description

    Statistics on the visits to the websites of the institutions located on the single platform of the websites of national and local authorities. Statistics do not reflect all website visitors, but only those who have consented to statistical cookies.

  5. Website Metrics

    • catalog.data.gov
    • datasets.ai
    • +1more
    Updated Jun 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FEMA/Office of External Affairs/Communication Division (2025). Website Metrics [Dataset]. https://catalog.data.gov/dataset/website-metrics
    Explore at:
    Dataset updated
    Jun 7, 2025
    Dataset provided by
    Federal Emergency Management Agencyhttp://www.fema.gov/
    Description

    Per the Federal Digital Government Strategy, the Department of Homeland Security Metrics Plan, and the Open FEMA Initiative, FEMA is providing the following web performance metrics with regards to FEMA.gov.rnrnInformation in this dataset includes total visits, avg visit duration, pageviews, unique visitors, avg pages/visit, avg time/page, bounce ratevisits by source, visits by Social Media Platform, and metrics on new vs returning visitors.rnrnExternal Affairs strives to make all communications accessible. If you have any challenges accessing this information, please contact FEMAWebTeam@fema.dhs.gov.

  6. Google Analytics Sample

    • kaggle.com
    zip
    Updated Sep 19, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google BigQuery (2019). Google Analytics Sample [Dataset]. https://www.kaggle.com/bigquery/google-analytics-sample
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Sep 19, 2019
    Dataset provided by
    Googlehttp://google.com/
    BigQueryhttps://cloud.google.com/bigquery
    Authors
    Google BigQuery
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website.

    Content

    The sample dataset contains Google Analytics 360 data from the Google Merchandise Store, a real ecommerce store. The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website. It includes the following kinds of information:

    Traffic source data: information about where website visitors originate. This includes data about organic traffic, paid search traffic, display traffic, etc. Content data: information about the behavior of users on the site. This includes the URLs of pages that visitors look at, how they interact with content, etc. Transactional data: information about the transactions that occur on the Google Merchandise Store website.

    Fork this kernel to get started.

    Acknowledgements

    Data from: https://bigquery.cloud.google.com/table/bigquery-public-data:google_analytics_sample.ga_sessions_20170801

    Banner Photo by Edho Pratama from Unsplash.

    Inspiration

    What is the total number of transactions generated per device browser in July 2017?

    The real bounce rate is defined as the percentage of visits with a single pageview. What was the real bounce rate per traffic source?

    What was the average number of product pageviews for users who made a purchase in July 2017?

    What was the average number of product pageviews for users who did not make a purchase in July 2017?

    What was the average total transactions per user that made a purchase in July 2017?

    What is the average amount of money spent per session in July 2017?

    What is the sequence of pages viewed?

  7. Data from: Google Analytics & Twitter dataset from a movies, TV series and...

    • figshare.com
    • portalcientificovalencia.univeuropea.com
    txt
    Updated Feb 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Víctor Yeste (2024). Google Analytics & Twitter dataset from a movies, TV series and videogames website [Dataset]. http://doi.org/10.6084/m9.figshare.16553061.v4
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 7, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Víctor Yeste
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Author: Víctor Yeste. Universitat Politècnica de Valencia.The object of this study is the design of a cybermetric methodology whose objectives are to measure the success of the content published in online media and the possible prediction of the selected success variables.In this case, due to the need to integrate data from two separate areas, such as web publishing and the analysis of their shares and related topics on Twitter, has opted for programming as you access both the Google Analytics v4 reporting API and Twitter Standard API, always respecting the limits of these.The website analyzed is hellofriki.com. It is an online media whose primary intention is to solve the need for information on some topics that provide daily a vast number of news in the form of news, as well as the possibility of analysis, reports, interviews, and many other information formats. All these contents are under the scope of the sections of cinema, series, video games, literature, and comics.This dataset has contributed to the elaboration of the PhD Thesis:Yeste Moreno, VM. (2021). Diseño de una metodología cibermétrica de cálculo del éxito para la optimización de contenidos web [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/176009Data have been obtained from each last-minute news article published online according to the indicators described in the doctoral thesis. All related data are stored in a database, divided into the following tables:tesis_followers: User ID list of media account followers.tesis_hometimeline: data from tweets posted by the media account sharing breaking news from the web.status_id: Tweet IDcreated_at: date of publicationtext: content of the tweetpath: URL extracted after processing the shortened URL in textpost_shared: Article ID in WordPress that is being sharedretweet_count: number of retweetsfavorite_count: number of favoritestesis_hometimeline_other: data from tweets posted by the media account that do not share breaking news from the web. Other typologies, automatic Facebook shares, custom tweets without link to an article, etc. With the same fields as tesis_hometimeline.tesis_posts: data of articles published by the web and processed for some analysis.stats_id: Analysis IDpost_id: Article ID in WordPresspost_date: article publication date in WordPresspost_title: title of the articlepath: URL of the article in the middle webtags: Tags ID or WordPress tags related to the articleuniquepageviews: unique page viewsentrancerate: input ratioavgtimeonpage: average visit timeexitrate: output ratiopageviewspersession: page views per sessionadsense_adunitsviewed: number of ads viewed by usersadsense_viewableimpressionpercent: ad display ratioadsense_ctr: ad click ratioadsense_ecpm: estimated ad revenue per 1000 page viewstesis_stats: data from a particular analysis, performed at each published breaking news item. Fields with statistical values can be computed from the data in the other tables, but total and average calculations are saved for faster and easier further processing.id: ID of the analysisphase: phase of the thesis in which analysis has been carried out (right now all are 1)time: "0" if at the time of publication, "1" if 14 days laterstart_date: date and time of measurement on the day of publicationend_date: date and time when the measurement is made 14 days latermain_post_id: ID of the published article to be analysedmain_post_theme: Main section of the published article to analyzesuperheroes_theme: "1" if about superheroes, "0" if nottrailer_theme: "1" if trailer, "0" if notname: empty field, possibility to add a custom name manuallynotes: empty field, possibility to add personalized notes manually, as if some tag has been removed manually for being considered too generic, despite the fact that the editor put itnum_articles: number of articles analysednum_articles_with_traffic: number of articles analysed with traffic (which will be taken into account for traffic analysis)num_articles_with_tw_data: number of articles with data from when they were shared on the media’s Twitter accountnum_terms: number of terms analyzeduniquepageviews_total: total page viewsuniquepageviews_mean: average page viewsentrancerate_mean: average input ratioavgtimeonpage_mean: average duration of visitsexitrate_mean: average output ratiopageviewspersession_mean: average page views per sessiontotal: total of ads viewedadsense_adunitsviewed_mean: average of ads viewedadsense_viewableimpressionpercent_mean: average ad display ratioadsense_ctr_mean: average ad click ratioadsense_ecpm_mean: estimated ad revenue per 1000 page viewsTotal: total incomeretweet_count_mean: average incomefavorite_count_total: total of favoritesfavorite_count_mean: average of favoritesterms_ini_num_tweets: total tweets on the terms on the day of publicationterms_ini_retweet_count_total: total retweets on the terms on the day of publicationterms_ini_retweet_count_mean: average retweets on the terms on the day of publicationterms_ini_favorite_count_total: total of favorites on the terms on the day of publicationterms_ini_favorite_count_mean: average of favorites on the terms on the day of publicationterms_ini_followers_talking_rate: ratio of followers of the media Twitter account who have recently published a tweet talking about the terms on the day of publicationterms_ini_user_num_followers_mean: average followers of users who have spoken of the terms on the day of publicationterms_ini_user_num_tweets_mean: average number of tweets published by users who spoke about the terms on the day of publicationterms_ini_user_age_mean: average age in days of users who have spoken of the terms on the day of publicationterms_ini_ur_inclusion_rate: URL inclusion ratio of tweets talking about terms on the day of publicationterms_end_num_tweets: total tweets on terms 14 days after publicationterms_ini_retweet_count_total: total retweets on terms 14 days after publicationterms_ini_retweet_count_mean: average retweets on terms 14 days after publicationterms_ini_favorite_count_total: total bookmarks on terms 14 days after publicationterms_ini_favorite_count_mean: average of favorites on terms 14 days after publicationterms_ini_followers_talking_rate: ratio of media Twitter account followers who have recently posted a tweet talking about the terms 14 days after publicationterms_ini_user_num_followers_mean: average followers of users who have spoken of the terms 14 days after publicationterms_ini_user_num_tweets_mean: average number of tweets published by users who have spoken about the terms 14 days after publicationterms_ini_user_age_mean: the average age in days of users who have spoken of the terms 14 days after publicationterms_ini_ur_inclusion_rate: URL inclusion ratio of tweets talking about terms 14 days after publication.tesis_terms: data of the terms (tags) related to the processed articles.stats_id: Analysis IDtime: "0" if at the time of publication, "1" if 14 days laterterm_id: Term ID (tag) in WordPressname: Name of the termslug: URL of the termnum_tweets: number of tweetsretweet_count_total: total retweetsretweet_count_mean: average retweetsfavorite_count_total: total of favoritesfavorite_count_mean: average of favoritesfollowers_talking_rate: ratio of followers of the media Twitter account who have recently published a tweet talking about the termuser_num_followers_mean: average followers of users who were talking about the termuser_num_tweets_mean: average number of tweets published by users who were talking about the termuser_age_mean: average age in days of users who were talking about the termurl_inclusion_rate: URL inclusion ratio

  8. E-commerce - Users of a French C2C fashion store

    • kaggle.com
    Updated Feb 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeffrey Mvutu Mabilama (2024). E-commerce - Users of a French C2C fashion store [Dataset]. https://www.kaggle.com/jmmvutu/ecommerce-users-of-a-french-c2c-fashion-store/notebooks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 24, 2024
    Dataset provided by
    Kaggle
    Authors
    Jeffrey Mvutu Mabilama
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    French
    Description

    Foreword

    This users dataset is a preview of a much bigger dataset, with lots of related data (product listings of sellers, comments on listed products, etc...).

    My Telegram bot will answer your queries and allow you to contact me.

    Context

    There are a lot of unknowns when running an E-commerce store, even when you have analytics to guide your decisions.

    Users are an important factor in an e-commerce business. This is especially true in a C2C-oriented store, since they are both the suppliers (by uploading their products) AND the customers (by purchasing other user's articles).

    This dataset aims to serve as a benchmark for an e-commerce fashion store. Using this dataset, you may want to try and understand what you can expect of your users and determine in advance how your grows may be.

    • For instance, if you see that most of your users are not very active, you may look into this dataset to compare your store's performance.

    If you think this kind of dataset may be useful or if you liked it, don't forget to show your support or appreciation with an upvote/comment. You may even include how you think this dataset might be of use to you. This way, I will be more aware of specific needs and be able to adapt my datasets to suits more your needs.

    This dataset is part of a preview of a much larger dataset. Please contact me for more.

    Content

    The data was scraped from a successful online C2C fashion store with over 10M registered users. The store was first launched in Europe around 2009 then expanded worldwide.

    Visitors vs Users: Visitors do not appear in this dataset. Only registered users are included. "Visitors" cannot purchase an article but can view the catalog.

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

    Inspiration

    Questions you might want to answer using this dataset:

    • Are e-commerce users interested in social network feature ?
    • Are my users active enough (compared to those of this dataset) ?
    • How likely are people from other countries to sign up in a C2C website ?
    • How many users are likely to drop off after years of using my service ?

    Example works:

    • Report(s) made using SQL queries can be found on the data.world page of the dataset.
    • Notebooks may be found on the Kaggle page of the dataset.

    License

    CC-BY-NC-SA 4.0

    For other licensing options, contact me.

  9. Google Analytics Sample

    • console.cloud.google.com
    Updated Jul 15, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:Obfuscated%20Google%20Analytics%20360%20data&hl=pl&inv=1&invt=Ab3yJQ (2017). Google Analytics Sample [Dataset]. https://console.cloud.google.com/marketplace/product/obfuscated-ga360-data/obfuscated-ga360-data?hl=pl
    Explore at:
    Dataset updated
    Jul 15, 2017
    Dataset provided by
    Googlehttp://google.com/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The dataset provides 12 months (August 2016 to August 2017) of obfuscated Google Analytics 360 data from the Google Merchandise Store , a real ecommerce store that sells Google-branded merchandise, in BigQuery. It’s a great way analyze business data and learn the benefits of using BigQuery to analyze Analytics 360 data Learn more about the data The data includes The data is typical of what an ecommerce website would see and includes the following information:Traffic source data: information about where website visitors originate, including data about organic traffic, paid search traffic, and display trafficContent data: information about the behavior of users on the site, such as URLs of pages that visitors look at, how they interact with content, etc. Transactional data: information about the transactions on the Google Merchandise Store website.Limitations: All users have view access to the dataset. This means you can query the dataset and generate reports but you cannot complete administrative tasks. Data for some fields is obfuscated such as fullVisitorId, or removed such as clientId, adWordsClickInfo and geoNetwork. “Not available in demo dataset” will be returned for STRING values and “null” will be returned for INTEGER values when querying the fields containing no data.This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery

  10. Uplift Modeling , Marketing Campaign Data

    • kaggle.com
    zip
    Updated Nov 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Möbius (2020). Uplift Modeling , Marketing Campaign Data [Dataset]. https://www.kaggle.com/arashnic/uplift-modeling
    Explore at:
    zip(340156703 bytes)Available download formats
    Dataset updated
    Nov 1, 2020
    Authors
    Möbius
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Uplift modeling is an important yet novel area of research in machine learning which aims to explain and to estimate the causal impact of a treatment at the individual level. In the digital advertising industry, the treatment is exposure to different ads and uplift modeling is used to direct marketing efforts towards users for whom it is the most efficient . The data is a collection collection of 13 million samples from a randomized control trial, scaling up previously available datasets by a healthy 590x factor.

    ###
    ###

    Content

    The dataset was created by The Criteo AI Lab .The dataset consists of 13M rows, each one representing a user with 12 features, a treatment indicator and 2 binary labels (visits and conversions). Positive labels mean the user visited/converted on the advertiser website during the test period (2 weeks). The global treatment ratio is 84.6%. It is usual that advertisers keep only a small control population as it costs them in potential revenue.

    Following is a detailed description of the features:

    • f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11: feature values (dense, float)
    • treatment: treatment group (1 = treated, 0 = control)
    • conversion: whether a conversion occured for this user (binary, label)
    • visit: whether a visit occured for this user (binary, label)
    • exposure: treatment effect, whether the user has been effectively exposed (binary)

    ###

    Context

    Uplift modeling is an important yet novel area of research in machine learning which aims to explain and to estimate the causal impact of a treatment at the individual level. In the digital advertising industry, the treatment is exposure to different ads and uplift modeling is used to direct marketing efforts towards users for whom it is the most efficient . The data is a collection collection of 13 million samples from a randomized control trial, scaling up previously available datasets by a healthy 590x factor.

    ###
    ###

    Content

    The dataset was created by The Criteo AI Lab .The dataset consists of 13M rows, each one representing a user with 12 features, a treatment indicator and 2 binary labels (visits and conversions). Positive labels mean the user visited/converted on the advertiser website during the test period (2 weeks). The global treatment ratio is 84.6%. It is usual that advertisers keep only a small control population as it costs them in potential revenue.

    Following is a detailed description of the features:

    • f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11: feature values (dense, float)
    • treatment: treatment group (1 = treated, 0 = control)
    • conversion: whether a conversion occured for this user (binary, label)
    • visit: whether a visit occured for this user (binary, label)
    • exposure: treatment effect, whether the user has been effectively exposed (binary)

    ###

    Starter Kernels

    Acknowledgement

    The data provided for paper: "A Large Scale Benchmark for Uplift Modeling"

    https://s3.us-east-2.amazonaws.com/criteo-uplift-dataset/large-scale-benchmark.pdf

    • Eustache Diemert CAIL e.diemert@criteo.com
    • Artem Betlei CAIL & Université Grenoble Alpes a.betlei@criteo.com
    • Christophe Renaudin CAIL c.renaudin@criteo.com
    • Massih-Reza Amini Université Grenoble Alpes massih-reza.amini@imag.fr

    For privacy reasons the data has been sub-sampled non-uniformly so that the original incrementality level cannot be deduced from the dataset while preserving a realistic, challenging benchmark. Feature names have been anonymized and their values randomly projected so as to keep predictive power while making it practically impossible to recover the original features or user context.

    Inspiration

    We can foresee related usages such as but not limited to:

    • Uplift modeling
    • Interactions between features and treatment
    • Heterogeneity of treatment

    More Readings

    MORE DATASETs ...

  11. Multilingual Scraper of Privacy Policies and Terms of Service

    • zenodo.org
    bin, zip
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Bernhard; David Bernhard; Luka Nenadic; Luka Nenadic; Stefan Bechtold; Karel Kubicek; Karel Kubicek; Stefan Bechtold (2025). Multilingual Scraper of Privacy Policies and Terms of Service [Dataset]. http://doi.org/10.5281/zenodo.14562039
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    David Bernhard; David Bernhard; Luka Nenadic; Luka Nenadic; Stefan Bechtold; Karel Kubicek; Karel Kubicek; Stefan Bechtold
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Multilingual Scraper of Privacy Policies and Terms of Service: Scraped Documents of 2024

    This dataset supplements publication "Multilingual Scraper of Privacy Policies and Terms of Service" at ACM CSLAW’25, March 25–27, 2025, München, Germany. It includes the first 12 months of scraped policies and terms from about 800k websites, see concrete numbers below.

    The following table lists the amount of websites visited per month:

    MonthNumber of websites
    2024-01551'148
    2024-02792'921
    2024-03844'537
    2024-04802'169
    2024-05805'878
    2024-06809'518
    2024-07811'418
    2024-08813'534
    2024-09814'321
    2024-10817'586
    2024-11828'662
    2024-12827'101

    The amount of websites visited should always be higher than the number of jobs (Table 1 of the paper) as a website may redirect, resulting in two websites scraped or it has to be retried.

    To simplify the access, we release the data in large CSVs. Namely, there is one file for policies and another for terms per month. All of these files contain all metadata that are usable for the analysis. If your favourite CSV parser reports the same numbers as above then our dataset is correctly parsed. We use ‘,’ as a separator, the first row is the heading and strings are in quotes.

    Since our scraper sometimes collects other documents than policies and terms (for how often this happens, see the evaluation in Sec. 4 of the publication) that might contain personal data such as addresses of authors of websites that they maintain only for a selected audience. We therefore decided to reduce the risks for websites by anonymizing the data using Presidio. Presidio substitutes personal data with tokens. If your personal data has not been effectively anonymized from the database and you wish for it to be deleted, please contact us.

    Preliminaries

    The uncompressed dataset is about 125 GB in size, so you will need sufficient storage. This also means that you likely cannot process all the data at once in your memory, so we split the data in months and in files for policies and terms.

    Files and structure

    The files have the following names:

    • 2024_policy.csv for policies
    • 2024_terms.csv for terms

    Shared metadata

    Both files contain the following metadata columns:

    • website_month_id - identification of crawled website
    • job_id - one website can have multiple jobs in case of redirects (but most commonly has only one)
    • website_index_status - network state of loading the index page. This is resolved by the Chromed DevTools Protocol.
      • DNS_ERROR - domain cannot be resolved
      • OK - all fine
      • REDIRECT - domain redirect to somewhere else
      • TIMEOUT - the request timed out
      • BAD_CONTENT_TYPE - 415 Unsupported Media Type
      • HTTP_ERROR - 404 error
      • TCP_ERROR - error in the network connection
      • UNKNOWN_ERROR - unknown error
    • website_lang - language of index page detected based on langdetect library
    • website_url - the URL of the website sampled from the CrUX list (may contain subdomains, etc). Use this as a unique identifier for connecting data between months.
    • job_domain_status - indicates the status of loading the index page. Can be:
      • OK - all works well (at the moment, should be all entries)
      • BLACKLISTED - URL is on our list of blocked URLs
      • UNSAFE - website is not safe according to save browsing API by Google
      • LOCATION_BLOCKED - country is in the list of blocked countries
    • job_started_at - when the visit of the website was started
    • job_ended_at - when the visit of the website was ended
    • job_crux_popularity - JSON with all popularity ranks of the website this month
    • job_index_redirect - when we detect that the domain redirects us, we stop the crawl and create a new job with the target URL. This saves time if many websites redirect to one target, as it will be crawled only once. The index_redirect is then the job.id corresponding to the redirect target.
    • job_num_starts - amount of crawlers that started this job (counts restarts in case of unsuccessful crawl, max is 3)
    • job_from_static - whether this job was included in the static selection (see Sec. 3.3 of the paper)
    • job_from_dynamic - whether this job was included in the dynamic selection (see Sec. 3.3 of the paper) - this is not exclusive with from_static - both can be true when the lists overlap.
    • job_crawl_name - our name of the crawl, contains year and month (e.g., 'regular-2024-12' for regular crawls, in Dec 2024)

    Policy data

    • policy_url_id - ID of the URL this policy has
    • policy_keyword_score - score (higher is better) according to the crawler's keywords list that given document is a policy
    • policy_ml_probability - probability assigned by the BERT model that given document is a policy
    • policy_consideration_basis - on which basis we decided that this url is policy. The following three options are executed by the crawler in this order:
      1. 'keyword matching' - this policy was found using the crawler navigation (which is based on keywords)
      2. 'search' - this policy was found using search engine
      3. 'path guessing' - this policy was found by using well-known URLs like example.com/policy
    • policy_url - full URL to the policy
    • policy_content_hash - used as identifier - if the document remained the same between crawls, it won't create a new entry
    • policy_content - contains the text of policies and terms extracted to Markdown using Mozilla's readability library
    • policy_lang - Language detected by fasttext of the content

    Terms data

    Analogous to policy data, just substitute policy to terms.

    Updates

    Check this Google Docs for an updated version of this README.md.

  12. Lead Scoring Dataset

    • kaggle.com
    zip
    Updated Aug 17, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amrita Chatterjee (2020). Lead Scoring Dataset [Dataset]. https://www.kaggle.com/amritachatterjee09/lead-scoring-dataset
    Explore at:
    zip(411028 bytes)Available download formats
    Dataset updated
    Aug 17, 2020
    Authors
    Amrita Chatterjee
    Description

    Context

    An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses.

    The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead. Moreover, the company also gets leads through past referrals. Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X education is around 30%.

    Now, although X Education gets a lot of leads, its lead conversion rate is very poor. For example, if, say, they acquire 100 leads in a day, only about 30 of them are converted. To make this process more efficient, the company wishes to identify the most potential leads, also known as ‘Hot Leads’. If they successfully identify this set of leads, the lead conversion rate should go up as the sales team will now be focusing more on communicating with the potential leads rather than making calls to everyone.

    There are a lot of leads generated in the initial stage (top) but only a few of them come out as paying customers from the bottom. In the middle stage, you need to nurture the potential leads well (i.e. educating the leads about the product, constantly communicating, etc. ) in order to get a higher lead conversion.

    X Education wants to select the most promising leads, i.e. the leads that are most likely to convert into paying customers. The company requires you to build a model wherein you need to assign a lead score to each of the leads such that the customers with higher lead score h have a higher conversion chance and the customers with lower lead score have a lower conversion chance. The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.

    Content

    Variables Description * Prospect ID - A unique ID with which the customer is identified. * Lead Number - A lead number assigned to each lead procured. * Lead Origin - The origin identifier with which the customer was identified to be a lead. Includes API, Landing Page Submission, etc. * Lead Source - The source of the lead. Includes Google, Organic Search, Olark Chat, etc. * Do Not Email -An indicator variable selected by the customer wherein they select whether of not they want to be emailed about the course or not. * Do Not Call - An indicator variable selected by the customer wherein they select whether of not they want to be called about the course or not. * Converted - The target variable. Indicates whether a lead has been successfully converted or not. * TotalVisits - The total number of visits made by the customer on the website. * Total Time Spent on Website - The total time spent by the customer on the website. * Page Views Per Visit - Average number of pages on the website viewed during the visits. * Last Activity - Last activity performed by the customer. Includes Email Opened, Olark Chat Conversation, etc. * Country - The country of the customer. * Specialization - The industry domain in which the customer worked before. Includes the level 'Select Specialization' which means the customer had not selected this option while filling the form. * How did you hear about X Education - The source from which the customer heard about X Education. * What is your current occupation - Indicates whether the customer is a student, umemployed or employed. * What matters most to you in choosing this course An option selected by the customer - indicating what is their main motto behind doing this course. * Search - Indicating whether the customer had seen the ad in any of the listed items. * Magazine
    * Newspaper Article * X Education Forums
    * Newspaper * Digital Advertisement * Through Recommendations - Indicates whether the customer came in through recommendations. * Receive More Updates About Our Courses - Indicates whether the customer chose to receive more updates about the courses. * Tags - Tags assigned to customers indicating the current status of the lead. * Lead Quality - Indicates the quality of lead based on the data and intuition the employee who has been assigned to the lead. * Update me on Supply Chain Content - Indicates whether the customer wants updates on the Supply Chain Content. * Get updates on DM Content - Indicates whether the customer wants updates on the DM Content. * Lead Profile - A lead level assigned to each customer based on their profile. * City - The city of the customer. * Asymmetric Activity Index - An index and score assigned to each customer based on their activity and their profile * Asymmetric Profile Index * Asymmetric Activity Score * Asymmetric Profile Score
    * I agree to pay the amount through cheque - Indicates whether the customer has agreed to pay the amount through cheque or not. * a free copy of Mastering The Interview - Indicates whether the customer wants a free copy of 'Mastering the Interview' or not. * Last Notable Activity - The last notable activity performed by the student.

    Acknowledgements

    UpGrad Case Study

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  13. d

    Statcan Dialogue Dataset

    • search.dataone.org
    • borealisdata.ca
    Updated Dec 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu, Xing Han; Reddy, Siva; de Vries, Harm (2023). Statcan Dialogue Dataset [Dataset]. http://doi.org/10.5683/SP3/NR0BMY
    Explore at:
    Dataset updated
    Dec 28, 2023
    Dataset provided by
    Borealis
    Authors
    Lu, Xing Han; Reddy, Siva; de Vries, Harm
    Description

    Welcome to the data repository for requesting access to the Statcan Dialogue Dataset! Before requesting access, you can visit our website or read our EACL 2023 paper Requesting Access In order to use our dataset, you must agree to the terms of use and restrictions before requesting access (see below). We will manually review each request and grant access or reach out to you for further information. To facilitate the process, make sure that: Your Dataverse account is linked to your professional/research website, which we may review to ensure the dataset will be used for the intended purpose Your request is made with an academic (e.g. .edu) or professional email (e.g. @servicenow.com). To do this, your have to set your primary email to your academic/professional email, or create a new Dataverse account. If your academic institution does not end with .edu, or you are part of a professional group that does not have an email address, please contact us (see email in paper). Abstract: We introduce the StatCan Dialogue Dataset consisting of 19,379 conversation turns between agents working at Statistics Canada and online users looking for published data tables. The conversations stem from genuine intents, are held in English or French, and lead to agents retrieving one of over 5000 complex data tables. Based on this dataset, we propose two tasks: (1) automatic retrieval of relevant tables based on a on-going conversation, and (2) automatic generation of appropriate agent responses at each turn. We investigate the difficulty of each task by establishing strong baselines. Our experiments on a temporal data split reveal that all models struggle to generalize to future conversations, as we observe a significant drop in performance across both tasks when we move from the validation to the test set. In addition, we find that response generation models struggle to decide when to return a table. Considering that the tasks pose significant challenges to existing models, we encourage the community to develop models for our task, which can be directly used to help knowledge workers find relevant tables for live chat users.

  14. P

    How do I file a dispute with Expedia?#Customer-Service Dataset

    • paperswithcode.com
    Updated Jun 28, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). How do I file a dispute with Expedia?#Customer-Service Dataset [Dataset]. https://paperswithcode.com/dataset/how-do-i-file-a-dispute-with-expedia-customer
    Explore at:
    Dataset updated
    Jun 28, 2025
    Description

    To file a dispute with Expedia, contact their customer support at+1-888-829-0881 or +1-805-330-4056 Clearly explain your issue with all booking details. If unresolved, escalate to a supervisor, use online chat or email, and consider filing a complaint with the BBB or disputing the charge with your bank. Who do I complain to about Expedia?? Who do I complain to about Expedia?, first review your booking details under “My Trips.” Contact Expedia customer service at +1-888-829-0881 or +1-805-330-4056 to explain the issue. If unresolved, request escalation or file a chargeback with your bank or credit card provider as a final step. How do I communicate with Expedia? To communicate with Expedia, you can call their customer service at +1-888-829-0881 or +1-805-330-4056. Alternatively, use their online chat support via the Expedia website or mobile app. For unresolved issues, consider reaching out through social media or emailing their customer service team directly for further assistance. Commissions How do I lodge a complaint against Expedia? To lodge a complaint against Expedia, start by contacting their customer support at +1-888-829-0881 or +1-805-330-4056If unresolved, escalate the issue by requesting a supervisor or reaching out via their official social media channels. For further action, consider filing a complaint with the Better Business Bureau (BBB) or the Federal Trade Commission (FTC). How do I dispute a charge on Expedia? To dispute a charge on Expedia, contact their customer service at +1-888-829-0881 or +1-805-330-4056. UK . A representative will assist you in reviewing the charge and guide you through the dispute process. Make sure to have your booking details ready for quicker resolution.” Who do I report Expedia to? If you need to report Expedia, start by contacting their customer support at +1-888-829-0881 or +1-805-330-4056. If the issue remains unresolved, you can file a complaint with the Better Business Bureau (BBB) through their website. Additionally, consider submitting a report to the Federal Trade Commission (FTC) via their official website. Publications How do I get to an agent on Expedia? To speak with a human representative at Expedia, you can call our customer service directly. Dial "+1-888-829-0881 or +1-805-330-4056 (Live Human)" (Quick connect) or 1-805-Expedia-LINE, follow the IVR instructions, or hold the line to be connected with a live agent. How do I file a claim against Expedia? To file a claim against Expedia, log in to your account and locate the booking in “My Trips.” Use their Help Center or contact customer support via chat or phone at+1-888-829-0881 or +1-805-330-4056 If unresolved, escalate the issue to a supervisor or submit a complaint to consumer protection agencies. How do I ask a question at Expedia? To ask a question at Expedia, visit their website and go to the Help Center. You can use the live chat feature or call customer support at +1-888-829-0881 or +1-805-330-4056. Alternatively, you can also email Expedia or reach out via their social media channels for assistance.

  15. d

    Vision Zero Benchmarking

    • catalog.data.gov
    • data.sfgov.org
    • +2more
    Updated Mar 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.sfgov.org (2025). Vision Zero Benchmarking [Dataset]. https://catalog.data.gov/dataset/vision-zero-benchmarking
    Explore at:
    Dataset updated
    Mar 29, 2025
    Dataset provided by
    data.sfgov.org
    Description

    A. SUMMARY This dataset contains the underlying data for the Vision Zero Benchmarking website. Vision Zero is the collaborative, citywide effort to end traffic fatalities in San Francisco. The goal of this benchmarking effort is to provide context to San Francisco’s work and progress on key Vision Zero metrics alongside its peers. The Controller's Office City Performance team collaborated with the San Francisco Municipal Transportation Agency, the San Francisco Department of Public Health, the San Francisco Police Department, and other stakeholders on this project. B. HOW THE DATASET IS CREATED The Vision Zero Benchmarking website has seven major metrics. The City Performance team collected the data for each metric separately, cleaned it, and visualized it on the website. This dataset has all seven metrics and some additional underlying data. The majority of the data is available through public sources, but a few data points came from the peer cities themselves. C. UPDATE PROCESS This dataset is for historical purposes only and will not be updated. To explore more recent data, visit the source website for the relevant metrics. D. HOW TO USE THIS DATASET This dataset contains all of the Vision Zero Benchmarking metrics. Filter for the metric of interest, then explore the data. Where applicable, datasets already include a total. For example, under the Fatalities metric, the "Total Fatalities" category within the metric shows the total fatalities in that city. Any calculations should be reviewed to not double-count data with this total. E. RELATED DATASETS N/A

  16. e

    Evaluating Website Quality - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Sep 10, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2013). Evaluating Website Quality - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/de050c84-6c90-552f-9c59-af1de6fb01ad
    Explore at:
    Dataset updated
    Sep 10, 2013
    Description

    Dit bestand bevat de data die zijn verzameld in het kader van het proefschrift van Sanne Elling: ‘Evaluating website quality: Five studies on user-focused evaluation methods’.Summary:The benefits of evaluating websites among potential users are widely acknowledged. There are several methods that can be used to evaluate the websites’ quality from a users’ perspective. In current practice, many evaluations are executed with inadequate methods that lack research-based validation. This thesis aims to gain more insight into evaluation methodology and to contribute to a higher standard of website evaluation in practice. A first way to evaluate website quality is measuring the users’ opinions. This is often done with questionnaires, which gather opinions in a cheap, fast, and easy way. However, many questionnaires seem to miss a solid statistical basis and a justification of the choice of quality dimensions and questions. We therefore developed the ‘Website Evaluation Questionnaire’ (WEQ), which was specifically designed for the evaluation of governmental websites. In a study in online and laboratory settings the WEQ has proved to be a valid and reliable instrument. A way to gather more specific user opinions, is inviting participants to review website pages. Participants provide their comments by clicking on a feedback button, marking a problematic segment, and formulating their feedback.There has been debate about the extent to which users are able to provide relevant feedback. The results of our studies showed that participants were able to provide useful feedback. They signalled many relevant problems that indeed were experienced by users who needed to find information on the website. Website quality can also be measured during participants’ task performance. A frequently used method is the concurrent think-aloud method (CTA), which involves participants who verbalize their thoughts while performing tasks. There have been doubts on the usefulness and exhaustiveness of participants’ verbalizations. Therefore, we have combined CTA and eye tracking in order to examine the cognitive processes that participants do and do not verbalize. The results showed that the participants’ verbalizations provided substantial information in addition to the directly observable user problems. There was also a rather high percentage of silences (27%) during which interesting observations could be made about the users’ processes and obstacles. A thorough evaluation should therefore combine verbalizations and (eye tracking) observations. In a retrospective think-aloud (RTA) evaluation participants verbalize their thoughts afterwards while watching a recording of their performance. A problem with RTA is that participants not always remember the thoughts they had during their task performance. We therefore complemented the dynamic screen replay of their actions (pages visited and mouse movements) with a dynamic gaze replay of the participants’ eye movements.Contrary to our expectations, no differences were found between the two conditions. It is not possible to draw conclusions on the single best method. The value of a specific method is strongly influenced by the goals and context of an evaluation. Also, the outcomes of the evaluation not only depend on the method, but also on other choices during the evaluation, such as participant selection, tasks, and the subsequent analysis.

  17. P

    @##How can I get my flight confirmation number? Dataset

    • paperswithcode.com
    Updated Jun 28, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). @##How can I get my flight confirmation number? Dataset [Dataset]. https://paperswithcode.com/dataset/how-can-i-get-my-flight-confirmation-number-3
    Explore at:
    Dataset updated
    Jun 28, 2025
    Description

    How can I get my flight confirmation number? Your Delta flight confirmation number is sent via email after booking. If you didn’t get it, call ☎️+1 (877) 443-8285 immediately. This number is usually a six-character alphanumeric code. You’ll find it in your email subject line or ticket receipt. Double-check your spam folder too. If you booked through a travel agent, they might provide the number instead. Always save a screenshot or printout for quick access. If you've lost it, visit the Delta website and select “Find My Trip.” Enter your name and card used for payment or call ☎️+1 (877) 443-8285 to retrieve it. Your confirmation number is crucial for managing your flight, checking in, or making any changes. Without it, airline staff may not be able to locate your booking. Booking through Delta’s app also stores this automatically under “My Trips.” Make sure to note it down during your purchase. If you still can't locate it, don’t panic—☎️+1 (877) 443-8285 support is available 24/7 to help. Agents can find your reservation using your email, phone number, or travel dates. Always confirm your flight and keep the confirmation number safe until your trip is completed. How to find itinerary on Delta Airlines? To find your itinerary on Delta Airlines, go to “My Trips” on their official website. For help, call ☎️+1 (877) 443-8285 anytime. You’ll need your last name and flight confirmation number. Once entered, your full travel itinerary will be displayed, including departure time, gate, layovers, and seat assignment. You can also access this via the Fly Delta app, which syncs your itinerary automatically. If you booked through a third party, check their platform too, or confirm directly with Delta by calling ☎️+1 (877) 443-8285. Always verify your travel plans at least 24–48 hours in advance to avoid surprises. Your itinerary is your official travel schedule, so it's vital to check for any updates, delays, or terminal changes. If you’re unsure how to navigate the site, their agents are happy to assist. ☎️+1 (877) 443-8285 is your go-to support line if anything seems off. Printing or downloading your itinerary is recommended for easy access during your trip. A well-reviewed itinerary ensures smoother airport check-in, boarding, and connections. Keep a digital and physical copy in case your phone battery dies. Staying updated with your itinerary helps avoid confusion and travel issues.

  18. w

    Calderdale Data Works Webstats

    • data.wu.ac.at
    csv, html, pdf, xls
    Updated Aug 24, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Calderdale Council (2018). Calderdale Data Works Webstats [Dataset]. https://data.wu.ac.at/odso/data_gov_uk/OThiNDdlZjEtZDAwOS00NDg5LTk3ZTEtYTUzODhiNWZiMDQ4
    Explore at:
    csv, pdf(102759.0), xls, html, pdfAvailable download formats
    Dataset updated
    Aug 24, 2018
    Dataset provided by
    Calderdale Council
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Area covered
    Calderdale
    Description

    Views, visits and downloads of the datasets published on our Data Works platform. Please note, each file uploaded on the site is listed and this may include files now deleted or changed.

    This dataset has information from data.gov.uk to show the number of downloads for each file and from Google Analytics to show the number of visits, unique visits, time spent, bounce rate and exit rate for each dataset.

    Our Open Data is also published on data.gov.uk and you can see statistics for this site here - Data.Gov.UK - Calderdale

  19. Average daily time spent on social media worldwide 2012-2025

    • statista.com
    Updated Jun 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Average daily time spent on social media worldwide 2012-2025 [Dataset]. https://www.statista.com/statistics/433871/daily-social-media-usage-worldwide/
    Explore at:
    Dataset updated
    Jun 19, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Worldwide
    Description

    How much time do people spend on social media? As of 2025, the average daily social media usage of internet users worldwide amounted to 141 minutes per day, down from 143 minutes in the previous year. Currently, the country with the most time spent on social media per day is Brazil, with online users spending an average of 3 hours and 49 minutes on social media each day. In comparison, the daily time spent with social media in the U.S. was just 2 hours and 16 minutes. Global social media usageCurrently, the global social network penetration rate is 62.3 percent. Northern Europe had an 81.7 percent social media penetration rate, topping the ranking of global social media usage by region. Eastern and Middle Africa closed the ranking with 10.1 and 9.6 percent usage reach, respectively. People access social media for a variety of reasons. Users like to find funny or entertaining content and enjoy sharing photos and videos with friends, but mainly use social media to stay in touch with current events friends. Global impact of social mediaSocial media has a wide-reaching and significant impact on not only online activities but also offline behavior and life in general. During a global online user survey in February 2019, a significant share of respondents stated that social media had increased their access to information, ease of communication, and freedom of expression. On the flip side, respondents also felt that social media had worsened their personal privacy, increased a polarization in politics and heightened everyday distractions.

  20. Z

    Data from: A Benchmark Suite for Systematically Evaluating Reasoning...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emanuele, Marconato (2024). A Benchmark Suite for Systematically Evaluating Reasoning Shortcuts [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11612555
    Explore at:
    Dataset updated
    Jun 13, 2024
    Dataset provided by
    Stefano, Teso
    Antonio, Vergari
    Emanuele, Marconato
    Andrea, Passerini
    Tommaso, Carraro
    Emile, van Krieken
    Paolo, Morettin
    Samuele, Bortolotti
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Codebase [Github] | Dataset [Zenodo]

    Abstract

    The advent of powerful neural classifiers has increased interest in problems that require both learning and reasoning. These problems are critical for understanding important properties of models, such as trustworthiness, generalization, interpretability, and compliance to safety and structural constraints. However, recent research observed that tasks requiring both learning and reasoning on background knowledge often suffer from reasoning shortcuts (RSs): predictors can solve the downstream reasoning task without associating the correct concepts to the high-dimensional data. To address this issue, we introduce rsbench, a comprehensive benchmark suite designed to systematically evaluate the impact of RSs on models by providing easy access to highly customizable tasks affected by RSs. Furthermore, rsbench implements common metrics for evaluating concept quality and introduces novel formal verification procedures for assessing the presence of RSs in learning tasks. Using rsbench, we highlight that obtaining high quality concepts in both purely neural and neuro-symbolic models is a far-from-solved problem. rsbench is available on Github.

    Usage

    We recommend visiting the official code website for instructions on how to use the dataset and accompaying software code.

    License

    All ready-made data sets and generated datasets are distributed under the CC-BY-SA 4.0 license, with the exception of Kand-Logic, which is derived from Kandinsky-patterns and as such is distributed under the GPL-3.0 license.

    Datasets Overview

    CLIP-embeddings. This folder contains the saved activations from a pretrained CLIP model applied to the tested dataset. It includes embeddings that represent the dataset in a format suitable for further analysis and experimentation.

    BDD_OIA-original-dataset. This directory holds the original files from the X-OIA project by Xu et al. [1]. These datasets have been made publicly available for ease of access and further research. If you are going to use it, please consider citing the original authors.

    kand-logic-3k. This folder contains all images generated for the Kand-Logic project. Each image is accompanied by annotations for both concepts and labels.

    bbox-kand-logic-3k. In this directory, you will find images from the Kand-Logic project that have undergone a preprocessing step. These images are extracted based on bounding boxes, rescaled, and include annotations for concepts and labels.

    sdd-oia. This folder includes all images and labels generated using rsbench.

    sdd-oia-embeddings. This directory contains 512-dimensional embeddings extracted from a pretrained ResNet18 model on ImageNet. The embeddings are derived from the sdd-oia`dataset.

    BDD-OIA-preprocessed. Here you will find preprocessed data that follow the methodology outlined by Sawada and Nakamura [2]. The folder contains 2048-dimensional embeddings extracted from a pretrained Faster-RCNN model on the BDD-100k dataset.

    The original BDD datasets can be downloaded from the following Google Drive link: [Download BDD Dataset].

    References

    [1] Xu et al., Explainable Object-Induced Action Decision for Autonomous Vehicles, CVPR 2020.

    [2] Sawada and Nakamura, Concept Bottleneck Model With Additional Unsupervised Concepts, IEEE 2022.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Bob Nau (2020). Daily website visitors (time series regression) [Dataset]. https://www.kaggle.com/bobnau/daily-website-visitors/code
Organization logo

Daily website visitors (time series regression)

Predict tomorrow's number of website visitors from 5 years of daily data

Explore at:
4 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 20, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Bob Nau
Description

Context

This file contains 5 years of daily time series data for several measures of traffic on a statistical forecasting teaching notes website whose alias is statforecasting.com. The variables have complex seasonality that is keyed to the day of the week and to the academic calendar. The patterns you you see here are similar in principle to what you would see in other daily data with day-of-week and time-of-year effects. Some good exercises are to develop a 1-day-ahead forecasting model, a 7-day ahead forecasting model, and an entire-next-week forecasting model (i.e., next 7 days) for unique visitors.

Content

The variables are daily counts of page loads, unique visitors, first-time visitors, and returning visitors to an academic teaching notes website. There are 2167 rows of data spanning the date range from September 14, 2014, to August 19, 2020. A visit is defined as a stream of hits on one or more pages on the site on a given day by the same user, as identified by IP address. Multiple individuals with a shared IP address (e.g., in a computer lab) are considered as a single user, so real users may be undercounted to some extent. A visit is classified as "unique" if a hit from the same IP address has not come within the last 6 hours. Returning visitors are identified by cookies if those are accepted. All others are classified as first-time visitors, so the count of unique visitors is the sum of the counts of returning and first-time visitors by definition. The data was collected through a traffic monitoring service known as StatCounter.

Inspiration

This file and a number of other sample datasets can also be found on the website of RegressIt, a free Excel add-in for linear and logistic regression which I originally developed for use in the course whose website generated the traffic data given here. If you use Excel to some extent as well as Python or R, you might want to try it out on this dataset.

Search
Clear search
Close search
Google apps
Main menu