48 datasets found
  1. Traces captured by visiting the top 1500 website

    • kaggle.com
    zip
    Updated Aug 25, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DNS_dataset (2021). Traces captured by visiting the top 1500 website [Dataset]. https://www.kaggle.com/jacksontang16/traces-captured-by-visiting-the-top-1500-website
    Explore at:
    zip(5852806 bytes)Available download formats
    Dataset updated
    Aug 25, 2021
    Authors
    DNS_dataset
    Description

    Dataset

    This dataset was created by DNS_dataset

    Contents

  2. Most visited websites by hierachycal categories

    • kaggle.com
    Updated Sep 18, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Natanael de Souza Figueiredo (2020). Most visited websites by hierachycal categories [Dataset]. https://www.kaggle.com/natanael127/most-visited-websites-by-hierachycal-categories/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 18, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Natanael de Souza Figueiredo
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Alexa Internet was founded in April 1996 by Brewster Kahle and Bruce Gilliat. The company's name was chosen in homage to the Library of Alexandria of Ptolemaic Egypt, drawing a parallel between the largest repository of knowledge in the ancient world and the potential of the Internet to become a similar store of knowledge. (from Wikipedia)

    The categories list was going out by September, 17h, 2020. So I would like to save it. https://support.alexa.com/hc/en-us/articles/360051913314

    This dataset was elaborated by this python script (V2.0): https://github.com/natanael127/dump-alexa-ranking

    Content

    The sites are grouped in 17 macro categories and this tree ends having more than 360.000 nodes. Subjects are very organized and each of them has its own rank of most accessed domains. So, even the keys of a sub-dictionary may be a good small dataset to use.

    Acknowledgements

    Thank you my friend André (https://github.com/andrerclaudio) by helping me with tips of Google Colaboratory and computational power to get the data until our deadline.

    Inspiration

    Alexa ranking was inspired by Library of Alexandria. In the modern world, it may be a good start for AI know more about many, many subjects of the world.

  3. n

    (Dataset) The most visited health websites in the world

    • narcis.nl
    • data.mendeley.com
    Updated Jan 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Acosta-Vargas, P (via Mendeley Data) (2021). (Dataset) The most visited health websites in the world [Dataset]. http://doi.org/10.17632/n468trh5my.1
    Explore at:
    Dataset updated
    Jan 11, 2021
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    Acosta-Vargas, P (via Mendeley Data)
    Description

    Evaluation of the most visited health websites in the world

  4. i

    Website Fingerprinting Dataset of Browsing Network Traffic for Desktop and...

    • ieee-dataport.org
    Updated Oct 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamad Amar Irsyad Mohd Aminuddin (2024). Website Fingerprinting Dataset of Browsing Network Traffic for Desktop and Mobile Webpages [Dataset]. https://ieee-dataport.org/documents/website-fingerprinting-dataset-browsing-network-traffic-desktop-and-mobile-webpages
    Explore at:
    Dataset updated
    Oct 21, 2024
    Authors
    Mohamad Amar Irsyad Mohd Aminuddin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a dataset of Tor cell file extracted from browsing simulation using Tor Browser. The simulations cover both desktop and mobile webpages. The data collection process was using WFP-Collector tool (https://github.com/irsyadpage/WFP-Collector). All the neccessary configuration to perform the simulation as detailed in the tool repository.The webpage URL is selected by using the first 100 website based on: https://dataforseo.com/free-seo-stats/top-1000-websites.Each webpage URL is visited 90 times for each deskop and mobile browsing mode.

  5. Colombia: most visited websites 2024, by unique visitors

    • statista.com
    • ai-chatbox.pro
    Updated Jun 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Colombia: most visited websites 2024, by unique visitors [Dataset]. https://www.statista.com/statistics/1409003/most-visited-websites-unique-visitors-colombia/
    Explore at:
    Dataset updated
    Jun 4, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Nov 2024
    Area covered
    Colombia
    Description

    In November 2024, Google.com was the leading website in Colombia by unique visits, with around 52.9 million single accesses to the URL during that month. YouTube.com came in second with approximately 30.9 million unique monthly visits. Facebook ranked third with 24.2 million unique monthly visits.

  6. Share of top U.S. websites ignoring user privacy preferences 2024

    • statista.com
    Updated Mar 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Share of top U.S. websites ignoring user privacy preferences 2024 [Dataset]. https://www.statista.com/statistics/1560221/us-privacy-preference-ignoring/
    Explore at:
    Dataset updated
    Mar 4, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Sep 2024
    Area covered
    United States
    Description

    As of September 2024, 75 percent of the 100 most visited websites in the United States shared personal data with advertising 3rd parties, even when users opted out. Moreover, 70 percent of them drop advertising 3rd party cookies even when users opt out.

  7. d

    NYC.gov Web Analytics

    • catalog.data.gov
    • data.cityofnewyork.us
    • +4more
    Updated Sep 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.cityofnewyork.us (2022). NYC.gov Web Analytics [Dataset]. https://catalog.data.gov/dataset/nyc-gov-web-analytics
    Explore at:
    Dataset updated
    Sep 30, 2022
    Dataset provided by
    data.cityofnewyork.us
    Area covered
    New York
    Description

    Web traffic statistics for the top 2000 most visited pages on nyc.gov by month.

  8. Z

    Dataset used for detecting DNS over HTTPS by Machine Learning.

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vekshin,Dmitrii (2020). Dataset used for detecting DNS over HTTPS by Machine Learning. [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3818004
    Explore at:
    Dataset updated
    Oct 28, 2020
    Dataset provided by
    Vekshin,Dmitrii
    Cejka,Tomas
    Hynek,Karel
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The dataset consists of three different data sources:

    DoH enabled Firefox

    DoH enabled Google Chrome

    Cloudflared DoH proxy

    The capture of web browser data was made using the Selenium framework, which simulated classical user browsing. The browsers received command for visiting domains taken from Alexa's top 10K most visited websites. The capturing was performed on the host by listening to the network interface of the virtual machine. Overall the dataset contains almost 5,000 web-page visits by Mozilla and 1,000 pages visited by Chrome.

    The Cloudflared DoH proxy was installed in Raspberry PI, and the IP address of the Raspberry was set as the default DNS resolver in two separate offices in our university. It was continuously capturing the DNS/DoH traffic created up to 20 devices for around three months.

    The dataset contains 1,128,904 flows from which is around 33,000 labeled as DoH. We provide raw pcap data, CSV with flow data, and CSV file with extracted features.

    The CSV with extracted features has the following data fields:

    • Label (1 - Doh, 0 - regular HTTPS)
    • Data source
    • Duration
    • Minimal Inter-Packet Delay
    • Maximal Inter-Packet Delay
    • Average Inter-Packet Delay
    • A variance of Incoming Packet Sizes
    • A variance of Outgoing Packet Sizes
    • A ratio of the number of Incoming and outgoing bytes
    • A ration of the number of Incoming and outgoing packets
    • Average of Incoming Packet sizes
    • Average of Outgoing Packet sizes
    • The median value of Incoming Packet sizes
    • The median value of outgoing Packet sizes
    • The ratio of bursts and pauses
    • Number of bursts
    • Number of pauses
    • Autocorrelation
    • Transmission symmetry in the 1st third of connection
    • Transmission symmetry in the 2nd third of connection
    • Transmission symmetry in the last third of connection

    The observed network traffic does not contain privacy-sensitive information.

    The zip file structure is:

    |-- data | |-- extracted-features...extracted features used in ML for DoH recognition | | |-- chrome | | |-- cloudflared | | -- firefox | |-- flows...............................................exported flow data | | |-- chrome | | |-- cloudflared | |-- firefox | -- pcaps....................................................raw PCAP data | |-- chrome | |-- cloudflared |-- firefox |-- LICENSE `-- README.md

    When using this dataset, please cite the original work as follows:

    @inproceedings{vekshin2020, author = {Vekshin, Dmitrii and Hynek, Karel and Cejka, Tomas}, title = {DoH Insight: Detecting DNS over HTTPS by Machine Learning}, year = {2020}, isbn = {9781450388337}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3407023.3409192}, doi = {10.1145/3407023.3409192}, booktitle = {Proceedings of the 15th International Conference on Availability, Reliability and Security}, articleno = {87}, numpages = {8}, keywords = {classification, DoH, DNS over HTTPS, machine learning, detection, datasets}, location = {Virtual Event, Ireland}, series = {ARES '20} }

  9. Context Ad Clicks Dataset

    • kaggle.com
    Updated Feb 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Möbius (2021). Context Ad Clicks Dataset [Dataset]. https://www.kaggle.com/arashnic/ctrtest/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 9, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Möbius
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    The dataset generated by an E-commerce website which sells a variety of products at its online platform. The records user behaviour of its customers and stores it as a log. However, most of the times, users do not buy the products instantly and there is a time gap during which the customer might surf the internet and maybe visit competitor websites. Now, to improve sales of products, website owner has hired an Adtech company which built a system such that ads are being shown for owner products on its partner websites. If a user comes to owner website and searches for a product, and then visits these partner websites or apps, his/her previously viewed items or their similar items are shown on as an ad. If the user clicks this ad, he/she will be redirected to the owner website and might buy the product.

    The task is to predict the probability i.e. probability of user clicking the ad which is shown to them on the partner websites for the next 7 days on the basis of historical view log data, ad impression data and user data.

    Content

    You are provided with the view log of users (2018/10/15 - 2018/12/11) and the product description collected from the owner website. We also provide the training data and test data containing details for ad impressions at the partner websites(Train + Test). Train data contains the impression logs during 2018/11/15 – 2018/12/13 along with the label which specifies whether the ad is clicked or not. Your model will be evaluated on the test data which have impression logs during 2018/12/12 – 2018/12/18 without the labels. You are provided with the following files:

    • train.zip: This contains 3 files and description of each is given below:
    • train.csv
    • view_log.csv
    • item_data.csv

      • test.csv: test file contains the impressions for which the participants need to predict the click rate sample_submission.csv: This file contains the format in which you have to submit your predictions.

    Inspiration

    • Predict the probability probability of user clicking the ad which is shown to them on the partner websites for the next 7 days on the basis of historical view log data, ad impression data and user data.

    The evaluated metric could be "area under the ROC curve" between the predicted probability and the observed target.

  10. A

    ‘NYC.gov Web Analytics’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘NYC.gov Web Analytics’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/data-gov-nyc-gov-web-analytics-2099/a81c8303/?iid=003-137&v=presentation
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    New York
    Description

    Analysis of ‘NYC.gov Web Analytics’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/f2b7ec11-c2ad-412c-8a63-914f40515c4d on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Web traffic statistics for the top 2000 most visited pages on nyc.gov by month.

    --- Original source retains full ownership of the source dataset ---

  11. YouTube's Channels Dataset

    • kaggle.com
    Updated Mar 31, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HarshitHGupta (2021). YouTube's Channels Dataset [Dataset]. https://www.kaggle.com/datasets/harshithgupta/youtubes-channels-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 31, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    HarshitHGupta
    Area covered
    YouTube
    Description

    Context

    YouTube is an American online video-sharing platform headquartered in San Bruno, California. The service, created in February 2005 by three former PayPal employees—Chad Hurley, Steve Chen, and Jawed Karim—was bought by Google in November 2006 for US$1.65 billion and now operates as one of the company's subsidiaries. YouTube is the second most-visited website after Google Search, according to Alexa Internet rankings.

    YouTube allows users to upload, view, rate, share, add to playlists, report, comment on videos, and subscribe to other users. Available content includes video clips, TV show clips, music videos, short and documentary films, audio recordings, movie trailers, live streams, video blogging, short original videos, and educational videos.

    YouTube (the world-famous video sharing website) maintains a list of the top trending videos on the platform. According to Variety magazine, “To determine the year’s top-trending videos, YouTube uses a combination of factors including measuring users interactions (number of views, shares, comments, and likes). Note that they’re not the most-viewed videos overall for the calendar year”. Top performers on the YouTube trending list are music videos (such as the famously virile “Gangam Style”), celebrity and/or reality TV performances, and the random dude-with-a-camera viral videos that YouTube is well-known for.

    This dataset is a daily record of the top trending YouTube videos.

    Note that this dataset is a structurally improved version of this dataset.

    Acknowledgements

    This dataset was collected using the YouTube API. This Description is cited in Wikipedia.

  12. o

    Artisanal mining site visits in Eastern DRC - Dataset - openAFRICA

    • open.africa
    Updated Feb 7, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). Artisanal mining site visits in Eastern DRC - Dataset - openAFRICA [Dataset]. https://open.africa/dataset/artisanal-mining-site-visits-in-eastern-drc
    Explore at:
    Dataset updated
    Feb 7, 2019
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Democratic Republic of the Congo
    Description

    IPIS has collected data on artisanal mining sites since 2009, and made it publicly accessible on webmaps and in analytical reports. The upgraded map presents new mining sites, bringing the total to more than 2400 sites visited as recently as December 2017. New information on the mining sites has been included. A new layer has been added displaying hundreds of roadblocks. The latest update of the map has been supported by the International Organization for Migration (IOM) in the DRC, through the USAID funded Responsible Minerals Trade (RMT) project

  13. Z

    Greek privacy policies dataset from PCI 2023 paper: "A privacy policies...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Dec 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Papoutsoglou, Maria (2023). Greek privacy policies dataset from PCI 2023 paper: "A privacy policies dataset in Greek in the GDPR era" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10435880
    Explore at:
    Dataset updated
    Dec 27, 2023
    Dataset provided by
    Kapitsaki, Georgia
    Papoutsoglou, Maria
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A dataset of privacy policies in the Greek language, with policies coming from top visited websites in Greece with a privacy policy in the Greek language.

    The dataset, as well as results of its analysis are included.

    if you want to use this dataset, please cite the relevant conference publication:

    Georgia M. Kapitsaki and Maria Papoutsoglou, "A privacy policies dataset in Greek in the GDPR era, in Proceedings of the 27th Pan-Hellenic Conference on Informatics, PCI 2023.

  14. A

    Most Viewed Digital Records in City Archives Digital Repository

    • data.boston.gov
    csv
    Updated Apr 12, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archives and Record Management (2019). Most Viewed Digital Records in City Archives Digital Repository [Dataset]. https://data.boston.gov/dataset/most-viewed-digital-records-in-city-archives-digital-repository
    Explore at:
    csv(7671), csv, csv(7531), csv(7791), csv(7559), csv(7760), csv(7735), csv(7727), csv(7393)Available download formats
    Dataset updated
    Apr 12, 2019
    Dataset authored and provided by
    Archives and Record Management
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    Monthly statistics for most viewed digital records in the City Archives Digital Repository.

  15. March Madness Historical DataSet (2002 to 2025)

    • kaggle.com
    Updated Apr 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonathan Pilafas (2025). March Madness Historical DataSet (2002 to 2025) [Dataset]. https://www.kaggle.com/datasets/jonathanpilafas/2024-march-madness-statistical-analysis/discussion?sort=undefined
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 22, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jonathan Pilafas
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This Kaggle dataset comes from an output dataset that powers my March Madness Data Analysis dashboard in Domo. - Click here to view this dashboard: Dashboard Link - Click here to view this dashboard features in a Domo blog post: Hoops, Data, and Madness: Unveiling the Ultimate NCAA Dashboard

    This dataset offers one the most robust resource you will find to discover key insights through data science and data analytics using historical NCAA Division 1 men's basketball data. This data, sourced from KenPom, goes as far back as 2002 and is updated with the latest 2025 data. This dataset is meticulously structured to provide every piece of information that I could pull from this site as an open-source tool for analysis for March Madness.

    Key features of the dataset include: - Historical Data: Provides all historical KenPom data from 2002 to 2025 from the Efficiency, Four Factors (Offense & Defense), Point Distribution, Height/Experience, and Misc. Team Stats endpoints from KenPom's website. Please note that the Height/Experience data only goes as far back as 2007, but every other source contains data from 2002 onward. - Data Granularity: This dataset features an individual line item for every NCAA Division 1 men's basketball team in every season that contains every KenPom metric that you can possibly think of. This dataset has the ability to serve as a single source of truth for your March Madness analysis and provide you with the granularity necessary to perform any type of analysis you can think of. - 2025 Tournament Insights: Contains all seed and region information for the 2025 NCAA March Madness tournament. Please note that I will continually update this dataset with the seed and region information for previous tournaments as I continue to work on this dataset.

    These datasets were created by downloading the raw CSV files for each season for the various sections on KenPom's website (Efficiency, Offense, Defense, Point Distribution, Summary, Miscellaneous Team Stats, and Height). All of these raw files were uploaded to Domo and imported into a dataflow using Domo's Magic ETL. In these dataflows, all of the column headers for each of the previous seasons are standardized to the current 2025 naming structure so all of the historical data can be viewed under the exact same field names. All of these cleaned datasets are then appended together, and some additional clean up takes place before ultimately creating the intermediate (INT) datasets that are uploaded to this Kaggle dataset. Once all of the INT datasets were created, I joined all of the tables together on the team name and season so all of these different metrics can be viewed under one single view. From there, I joined an NCAAM Conference & ESPN Team Name Mapping table to add a conference field in its full length and respective acronyms they are known by as well as the team name that ESPN currently uses. Please note that this reference table is an aggregated view of all of the different conferences a team has been a part of since 2002 and the different team names that KenPom has used historically, so this mapping table is necessary to map all of the teams properly and differentiate the historical conferences from their current conferences. From there, I join a reference table that includes all of the current NCAAM coaches and their active coaching lengths because the active current coaching length typically correlates to a team's success in the March Madness tournament. I also join another reference table to include the historical post-season tournament teams in the March Madness, NIT, CBI, and CIT tournaments, and I join another reference table to differentiate the teams who were ranked in the top 12 in the AP Top 25 during week 6 of the respective NCAA season. After some additional data clean-up, all of this cleaned data exports into the "DEV _ March Madness" file that contains the consolidated view of all of this data.

    This dataset provides users with the flexibility to export data for further analysis in platforms such as Domo, Power BI, Tableau, Excel, and more. This dataset is designed for users who wish to conduct their own analysis, develop predictive models, or simply gain a deeper understanding of the intricacies that result in the excitement that Division 1 men's college basketball provides every year in March. Whether you are using this dataset for academic research, personal interest, or professional interest, I hope this dataset serves as a foundational tool for exploring the vast landscape of college basketball's most riveting and anticipated event of its season.

  16. Uplift Modeling , Marketing Campaign Data

    • kaggle.com
    zip
    Updated Nov 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Möbius (2020). Uplift Modeling , Marketing Campaign Data [Dataset]. https://www.kaggle.com/arashnic/uplift-modeling
    Explore at:
    zip(340156703 bytes)Available download formats
    Dataset updated
    Nov 1, 2020
    Authors
    Möbius
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Uplift modeling is an important yet novel area of research in machine learning which aims to explain and to estimate the causal impact of a treatment at the individual level. In the digital advertising industry, the treatment is exposure to different ads and uplift modeling is used to direct marketing efforts towards users for whom it is the most efficient . The data is a collection collection of 13 million samples from a randomized control trial, scaling up previously available datasets by a healthy 590x factor.

    ###
    ###

    Content

    The dataset was created by The Criteo AI Lab .The dataset consists of 13M rows, each one representing a user with 12 features, a treatment indicator and 2 binary labels (visits and conversions). Positive labels mean the user visited/converted on the advertiser website during the test period (2 weeks). The global treatment ratio is 84.6%. It is usual that advertisers keep only a small control population as it costs them in potential revenue.

    Following is a detailed description of the features:

    • f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11: feature values (dense, float)
    • treatment: treatment group (1 = treated, 0 = control)
    • conversion: whether a conversion occured for this user (binary, label)
    • visit: whether a visit occured for this user (binary, label)
    • exposure: treatment effect, whether the user has been effectively exposed (binary)

    ###

    Context

    Uplift modeling is an important yet novel area of research in machine learning which aims to explain and to estimate the causal impact of a treatment at the individual level. In the digital advertising industry, the treatment is exposure to different ads and uplift modeling is used to direct marketing efforts towards users for whom it is the most efficient . The data is a collection collection of 13 million samples from a randomized control trial, scaling up previously available datasets by a healthy 590x factor.

    ###
    ###

    Content

    The dataset was created by The Criteo AI Lab .The dataset consists of 13M rows, each one representing a user with 12 features, a treatment indicator and 2 binary labels (visits and conversions). Positive labels mean the user visited/converted on the advertiser website during the test period (2 weeks). The global treatment ratio is 84.6%. It is usual that advertisers keep only a small control population as it costs them in potential revenue.

    Following is a detailed description of the features:

    • f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11: feature values (dense, float)
    • treatment: treatment group (1 = treated, 0 = control)
    • conversion: whether a conversion occured for this user (binary, label)
    • visit: whether a visit occured for this user (binary, label)
    • exposure: treatment effect, whether the user has been effectively exposed (binary)

    ###

    Starter Kernels

    Acknowledgement

    The data provided for paper: "A Large Scale Benchmark for Uplift Modeling"

    https://s3.us-east-2.amazonaws.com/criteo-uplift-dataset/large-scale-benchmark.pdf

    • Eustache Diemert CAIL e.diemert@criteo.com
    • Artem Betlei CAIL & Université Grenoble Alpes a.betlei@criteo.com
    • Christophe Renaudin CAIL c.renaudin@criteo.com
    • Massih-Reza Amini Université Grenoble Alpes massih-reza.amini@imag.fr

    For privacy reasons the data has been sub-sampled non-uniformly so that the original incrementality level cannot be deduced from the dataset while preserving a realistic, challenging benchmark. Feature names have been anonymized and their values randomly projected so as to keep predictive power while making it practically impossible to recover the original features or user context.

    Inspiration

    We can foresee related usages such as but not limited to:

    • Uplift modeling
    • Interactions between features and treatment
    • Heterogeneity of treatment

    More Readings

    MORE DATASETs ...

  17. TED Talks

    • kaggle.com
    • data.wu.ac.at
    zip
    Updated Sep 9, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rounak Banik (2017). TED Talks [Dataset]. https://www.kaggle.com/rounakbanik/ted-talks
    Explore at:
    zip(62193209 bytes)Available download formats
    Dataset updated
    Sep 9, 2017
    Authors
    Rounak Banik
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Context

    These datasets contain information about all audio-video recordings of TED Talks uploaded to the official TED.com website until September 2012. The TED favorites dataset contains information about the videos, registered users have favorited. The TED Talks dataset contains information about all talks including number of views, number of comments, descriptions, speakers and titles.

    The original datasets (in the JSON format) contain all the aforementioned information and in addition, also contain all the data related to content and replies.

    Content (for the CSV files)

    TED Talks

    1. id: The ID of the talk. Has no inherent meaning. Values obtained by resetting index.
    2. url: Url pointing to the talk on http://www.ted.com
    3. title: The title of the talk
    4. description: Short description of the talk
    5. transcript: The human-made transcript in English of the talk as available on the TED website
    6. related_tags: The related tags that refer to the talk assigned by TED editorial staff
    7. related_themes: The related themes that refer to the talk assigned by TED editorial staff (as of April 2013, these themes are no longer visible on the TED website)
    8. related_videos: A list of videos that are related to the given talk, assigned by TED editorial staff
    9. ted_event: The name of the event in which the talk was presented
    10. speaker: The full name of the speaker of the talk
    11. publish_date: The day on which the talk was published
    12. film_date: The day on which the talk was filmed
    13. views: The total number of views of the talk at the time of crawling
    14. comments: The number of comments on the talk

    TED Favorites

    1. talk: Name of the talk being favorited.
    2. user: The ID of the user favoriting the talk. Has no inherent meaning. This ID is INDEPENDENT of the talk ID of the aforementioned Talks dataset.

    Acknowledgements

    The original dataset was obtained from https://www.idiap.ch/dataset/ted and was in the JSON Format. Taken verbatim from the website:

    The metadata was obtained by crawling the HTML source of the list of talks and users, as well as talk and user webpages using scripts written by Nikolaos Pappas at the Idiap Research Institute, Martigny, Switzerland. The dataset is shared under the Creative Commons license (the same as the content of the TED talks) which is stored in the COPYRIGHT file. The dataset is shared for research purposes which are explained in detail in the following papers. The dataset can be used to benchmark systems that perform two tasks, namely personalized recommendations and generic recommendations. Please check the CBMI 2013 paper for a detailed description of each task.

    1. Nikolaos Pappas, Andrei Popescu-Belis, "Combining Content with User Preferences for TED Lecture Recommendation", 11th International Workshop on Content Based Multimedia Indexing, Veszprém, Hungary, IEEE, 2013
    2. Nikolaos Pappas, Andrei Popescu-Belis, Sentiment Analysis of User Comments for One-Class Collaborative Filtering over TED Talks, 36th ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, ACM, 2013

    The datasets uploaded were used by the second paper listed above.

    The ones available here in the CSV format do not include the text data of comments. Instead, they just give you the number of comments on each talk.

    Inspiration

    I've always been fascinated by TED Talks and the immense diversity of content that it provides for free. I was also thoroughly inspired by a TED Talk that visually explored TED Talks stats and I was motivated to do the same thing, albeit on a much less grander scale.

    Some of the questions that can be answered with this dataset: 1. How is each TED Talk related to every other TED Talk? 2. Which are the most viewed and most favorited Talks of all time? Are they mostly the same? What does this tell us? 3. What kind of topics attract the maximum discussion and debate (in the form of comments)? 4. Which months are most popular among TED and TEDx chapters?

  18. Imgur Most Viral and Secret Santa

    • kaggle.com
    Updated Apr 18, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ghalib93 (2020). Imgur Most Viral and Secret Santa [Dataset]. https://www.kaggle.com/ghalib93/imgur-most-viral-and-secret-santa/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 18, 2020
    Dataset provided by
    Kaggle
    Authors
    Ghalib93
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Imgur is an image hosting and sharing website founded in 2009. It became one of the most popular websites around the world with approximately 250 million users. The website does not require registration and anyone can browse its content. However, to be able to post an account must be created. It is famous for an event that it created in 2013 where members get to register to send/receive gifts from other members on the website. The event takes place during Christmas time and people share their gifts via the website where they post pictures of the process or what they received in a specific tag. Today the data provided covers two sections that I think are important to understanding certain patterns within the Imgur community. The first is the Most Viral section and the second is the Secret Santa tag.

    I have participated twice in The Imgur secret Santa event and always found funny and interesting post from its most viral section. I would like with the help of the Kaggle community to identify trends from the data provided and maybe make a comparison between the Secret Santa data and the most viral.

    Content

    There are two Dataframes included and they are almost identical in the number of columns:

    • The first Dataframe is Imgur Most Viral posts. This contains many of the posts that were labelled as Viral by The Imgur community and team using specific algorithms to track number of likes and dislikes across multiple platforms. The posts might be videos, gifs, pictures or just text.

    • The second Dataframe is Imgur Secret Santa Tag. Secret Santa is an annual Imgur tradition where members can sign up to send gifts to and receive gifts from other members during the Christmas holiday.This contains many of the posts that were tagged with Secret Santa by the Imgur community. The posts might be videos, gifs, pictures or just text. There is a (is_viral) column in this Dataframe that is not available in the Most Viral Dataframe since all of the posts there are viral.

      Data Dictionary

      FeatureTypeDatasetDescription
      account_idobjectImgur_Viral/imgur_secret_santaUnique Account ID per member
      comment_countfloat64Imgur_Viral/imgur_secret_santaNumber of comments made in the post
      datetimefloat64Imgur_Viral/imgur_secret_santaTimeStamp containing Date and Time Details
      downsfloat64Imgur_Viral/imgur_secret_santaNumber of dislikes for the post
      favorite_countfloat64Imgur_Viral/imgur_secret_santaNumber of user that marked the post as a favourite
      idobjectImgur_Viral/imgur_secret_santaUniqe Post ID. Even if it was posted by the same member, different posts will have different IDs
      images_countfloat64Imgur_Viral/imgur_secret_santaNumber of images included in the post
      pointsfloat64Imgur_Viral/imgur_secret_santaEach post will have calculated points based on (ups - downs)
      scorefloat64Imgur_Viral/imgur_secret_santaTicket number
      tagsobjectImgur_Viral/imgur_secret_santaTags are sub albums that the post will show under
      titleobjectImgur_Viral/imgur_secret_santaTitle of the post
      upsfloat64Imgur_Viral/imgur_secret_santaNumber of likes for the post
      viewsfloat64Imgur_Viral/imgur_secret_santaNumber of people that viewed the post
      is_most_viralbooleanimgur_secret_santaIf the post is viral or not

    Acknowledgements

    I would like to thank imgur for providing an API that made collecting data easier from its website. With their help we might be able to better understand certain trends that emerge from its community

    Inspiration

    There is no problem to solve from this data, but it just a fun way to explore and learn more about programming and analyzing data. I hope you enjoy playing with the data as much as I did collecting it and browsing the website

  19. C

    Website municipality of Utrecht

    • ckan.mobidatalab.eu
    Updated Jul 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OverheidNl (2023). Website municipality of Utrecht [Dataset]. https://ckan.mobidatalab.eu/dataset/utrecht-website-gemeente-utrecht
    Explore at:
    http://publications.europa.eu/resource/authority/file-type/csv, http://publications.europa.eu/resource/authority/file-type/zip, http://publications.europa.eu/resource/authority/file-type/psdAvailable download formats
    Dataset updated
    Jul 13, 2023
    Dataset provided by
    OverheidNl
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Utrecht
    Description

    This is a combination of the following datasets: * Icons website Municipality of Utrecht * Most visited topics website municipality of Utrecht < a name="iconen-website-gemeente-utrecht"> #### Iconen website Gemeente Utrecht For the website of the Municipality of Utrecht, 45 different icons have been developed. The datasets in PNG and PSD (Photoshop) format are shown below, and there is also a preview in which the icons are shown. Examples of available icons are: * zoning plan; * notification; * to marry; * integration. #### Most visited topics website municipality of Utrecht Overview of the topics most searched for on the website municipality of Utrecht (www.utrecht .NL). This information is presented per month and includes the following data per object: * click path on website (where subject can be found); * number of page views; * average time of visit to website. * link web site.

  20. Q

    Data for: Debating Algorithmic Fairness

    • data.qdr.syr.edu
    Updated Nov 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Melissa Hamilton; Melissa Hamilton (2023). Data for: Debating Algorithmic Fairness [Dataset]. http://doi.org/10.5064/F6JOQXNF
    Explore at:
    pdf(53179), pdf(63339), pdf(285052), pdf(103333), application/x-json-hypothesis(55745), pdf(256399), jpeg(101993), pdf(233414), pdf(536400), pdf(786428), pdf(2243113), pdf(109638), pdf(176988), pdf(59204), pdf(124046), pdf(802960), pdf(82120)Available download formats
    Dataset updated
    Nov 13, 2023
    Dataset provided by
    Qualitative Data Repository
    Authors
    Melissa Hamilton; Melissa Hamilton
    License

    https://qdr.syr.edu/policies/qdr-standard-access-conditionshttps://qdr.syr.edu/policies/qdr-standard-access-conditions

    Time period covered
    2008 - 2017
    Area covered
    United States
    Description

    This is an Annotation for Transparent Inquiry (ATI) data project. The annotated article can be viewed on the Publisher's Website. Data Generation The research project engages a story about perceptions of fairness in criminal justice decisions. The specific focus involves a debate between ProPublica, a news organization, and Northpointe, the owner of a popular risk tool called COMPAS. ProPublica wrote that COMPAS was racist against blacks, while Northpointe posted online a reply rejecting such a finding. These two documents were the obvious foci of the qualitative analysis because of the further media attention they attracted, the confusion their competing conclusions caused readers, and the power both companies wield in public circles. There were no barriers to retrieval as both documents have been publicly available on their corporate websites. This public access was one of the motivators for choosing them as it meant that they were also easily attainable by the general public, thus extending the documents’ reach and impact. Additional materials from ProPublica relating to the main debate were also freely downloadable from its website and a third party, open source platform. Access to secondary source materials comprising additional writings from Northpointe representatives that could assist in understanding Northpointe’s main document, though, was more limited. Because of a claim of trade secrets on its tool and the underlying algorithm, it was more difficult to reach Northpointe’s other reports. Nonetheless, largely because its clients are governmental bodies with transparency and accountability obligations, some of Northpointe-associated reports were retrievable from third parties who had obtained them, largely through Freedom of Information Act queries. Together, the primary and (retrievable) secondary sources allowed for a triangulation of themes, arguments, and conclusions. The quantitative component uses a dataset of over 7,000 individuals with information that was collected and compiled by ProPublica and made available to the public on github. ProPublica’s gathering the data directly from criminal justice officials via Freedom of Information Act requests rendered the dataset in the public domain, and thus no confidentiality issues are present. The dataset was loaded into SPSS v. 25 for data analysis. Data Analysis The qualitative enquiry used critical discourse analysis, which investigates ways in which parties in their communications attempt to create, legitimate, rationalize, and control mutual understandings of important issues. Each of the two main discourse documents was parsed on its own merit. Yet the project was also intertextual in studying how the discourses correspond with each other and to other relevant writings by the same authors. Several more specific types of discursive strategies were of interest in attracting further critical examination: Testing claims and rationalizations that appear to serve the speaker’s self-interest Examining conclusions and determining whether sufficient evidence supported them Revealing contradictions and/or inconsistencies within the same text and intertextually Assessing strategies underlying justifications and rationalizations used to promote a party’s assertions and arguments Noticing strategic deployment of lexical phrasings, syntax, and rhetoric Judging sincerity of voice and the objective consideration of alternative perspectives Of equal importance in a critical discourse analysis is consideration of what is not addressed, that is to uncover facts and/or topics missing from the communication. For this project, this included parsing issues that were either briefly mentioned and then neglected, asserted yet the significance left unstated, or not suggested at all. This task required understanding common practices in the algorithmic data science literature. The paper could have been completed with just the critical discourse analysis. However, because one of the salient findings from it highlighted that the discourses overlooked numerous definitions of algorithmic fairness, the call to fill this gap seemed obvious. Then, the availability of the same dataset used by the parties in conflict, made this opportunity more appealing. Calculating additional algorithmic equity equations would not thereby be troubled by irregularities because of diverse sample sets. New variables were created as relevant to calculate algorithmic fairness equations. In addition to using various SPSS Analyze functions (e.g., regression, crosstabs, means), online statistical calculators were useful to compute z-test comparisons of proportions and t-test comparisons of means. Logic of Annotation Annotations were employed to fulfil a variety of functions, including supplementing the main text with context, observations, counter-points, analysis, and source attributions. These fall under a few categories. Space considerations. Critical discourse analysis offers a rich method...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
DNS_dataset (2021). Traces captured by visiting the top 1500 website [Dataset]. https://www.kaggle.com/jacksontang16/traces-captured-by-visiting-the-top-1500-website
Organization logo

Traces captured by visiting the top 1500 website

Traffic captured by visiting the top 1500 most visited sites ranked by Alexa

Explore at:
zip(5852806 bytes)Available download formats
Dataset updated
Aug 25, 2021
Authors
DNS_dataset
Description

Dataset

This dataset was created by DNS_dataset

Contents

Search
Clear search
Close search
Google apps
Main menu