48 datasets found

Traces captured by visiting the top 1500 website
kaggle.com
zip
Updated Aug 25, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DNS_dataset (2021). Traces captured by visiting the top 1500 website [Dataset]. https://www.kaggle.com/jacksontang16/traces-captured-by-visiting-the-top-1500-website
Explore at:
zip(5852806 bytes)Available download formats
Dataset updated
Aug 25, 2021
Authors
DNS_dataset
Description
Dataset

This dataset was created by DNS_dataset

Contents
Most visited websites by hierachycal categories
kaggle.com
Updated Sep 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Natanael de Souza Figueiredo (2020). Most visited websites by hierachycal categories [Dataset]. https://www.kaggle.com/natanael127/most-visited-websites-by-hierachycal-categories/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 18, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Natanael de Souza Figueiredo
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Alexa Internet was founded in April 1996 by Brewster Kahle and Bruce Gilliat. The company's name was chosen in homage to the Library of Alexandria of Ptolemaic Egypt, drawing a parallel between the largest repository of knowledge in the ancient world and the potential of the Internet to become a similar store of knowledge. (from Wikipedia)

The categories list was going out by September, 17h, 2020. So I would like to save it. https://support.alexa.com/hc/en-us/articles/360051913314

This dataset was elaborated by this python script (V2.0): https://github.com/natanael127/dump-alexa-ranking

Content

The sites are grouped in 17 macro categories and this tree ends having more than 360.000 nodes. Subjects are very organized and each of them has its own rank of most accessed domains. So, even the keys of a sub-dictionary may be a good small dataset to use.

Acknowledgements

Thank you my friend André (https://github.com/andrerclaudio) by helping me with tips of Google Colaboratory and computational power to get the data until our deadline.

Inspiration

Alexa ranking was inspired by Library of Alexandria. In the modern world, it may be a good start for AI know more about many, many subjects of the world.
n
(Dataset) The most visited health websites in the world
narcis.nl
data.mendeley.com
Updated Jan 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Acosta-Vargas, P (via Mendeley Data) (2021). (Dataset) The most visited health websites in the world [Dataset]. http://doi.org/10.17632/n468trh5my.1
Explore at:
Unique identifier
https://doi.org/10.17632/n468trh5my.1
Dataset updated
Jan 11, 2021
Dataset provided by
Data Archiving and Networked Services (DANS)
Authors
Acosta-Vargas, P (via Mendeley Data)
Description
Evaluation of the most visited health websites in the world
i
Website Fingerprinting Dataset of Browsing Network Traffic for Desktop and...
ieee-dataport.org
Updated Oct 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamad Amar Irsyad Mohd Aminuddin (2024). Website Fingerprinting Dataset of Browsing Network Traffic for Desktop and Mobile Webpages [Dataset]. https://ieee-dataport.org/documents/website-fingerprinting-dataset-browsing-network-traffic-desktop-and-mobile-webpages
Explore at:
Dataset updated
Oct 21, 2024
Authors
Mohamad Amar Irsyad Mohd Aminuddin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a dataset of Tor cell file extracted from browsing simulation using Tor Browser. The simulations cover both desktop and mobile webpages. The data collection process was using WFP-Collector tool (https://github.com/irsyadpage/WFP-Collector). All the neccessary configuration to perform the simulation as detailed in the tool repository.The webpage URL is selected by using the first 100 website based on: https://dataforseo.com/free-seo-stats/top-1000-websites.Each webpage URL is visited 90 times for each deskop and mobile browsing mode.
Colombia: most visited websites 2024, by unique visitors
statista.com
ai-chatbox.pro
Updated Jun 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Colombia: most visited websites 2024, by unique visitors [Dataset]. https://www.statista.com/statistics/1409003/most-visited-websites-unique-visitors-colombia/
Explore at:
Dataset updated
Jun 4, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Nov 2024
Area covered
Colombia
Description
In November 2024, Google.com was the leading website in Colombia by unique visits, with around 52.9 million single accesses to the URL during that month. YouTube.com came in second with approximately 30.9 million unique monthly visits. Facebook ranked third with 24.2 million unique monthly visits.
Share of top U.S. websites ignoring user privacy preferences 2024
statista.com
Updated Mar 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Share of top U.S. websites ignoring user privacy preferences 2024 [Dataset]. https://www.statista.com/statistics/1560221/us-privacy-preference-ignoring/
Explore at:
Dataset updated
Mar 4, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Sep 2024
Area covered
United States
Description
As of September 2024, 75 percent of the 100 most visited websites in the United States shared personal data with advertising 3rd parties, even when users opted out. Moreover, 70 percent of them drop advertising 3rd party cookies even when users opt out.
d
NYC.gov Web Analytics
catalog.data.gov
data.cityofnewyork.us
+4more
Updated Sep 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.cityofnewyork.us (2022). NYC.gov Web Analytics [Dataset]. https://catalog.data.gov/dataset/nyc-gov-web-analytics
Explore at:
Dataset updated
Sep 30, 2022
Dataset provided by
data.cityofnewyork.us
Area covered
New York
Description
Web traffic statistics for the top 2000 most visited pages on nyc.gov by month.
Z
Dataset used for detecting DNS over HTTPS by Machine Learning.
data.niaid.nih.gov
zenodo.org
Updated Oct 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vekshin,Dmitrii (2020). Dataset used for detecting DNS over HTTPS by Machine Learning. [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3818004
Explore at:
Dataset updated
Oct 28, 2020
Dataset provided by
Vekshin,Dmitrii
Cejka,Tomas
Hynek,Karel
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The dataset consists of three different data sources:

DoH enabled Firefox

DoH enabled Google Chrome

Cloudflared DoH proxy

The capture of web browser data was made using the Selenium framework, which simulated classical user browsing. The browsers received command for visiting domains taken from Alexa's top 10K most visited websites. The capturing was performed on the host by listening to the network interface of the virtual machine. Overall the dataset contains almost 5,000 web-page visits by Mozilla and 1,000 pages visited by Chrome.

The Cloudflared DoH proxy was installed in Raspberry PI, and the IP address of the Raspberry was set as the default DNS resolver in two separate offices in our university. It was continuously capturing the DNS/DoH traffic created up to 20 devices for around three months.

The dataset contains 1,128,904 flows from which is around 33,000 labeled as DoH. We provide raw pcap data, CSV with flow data, and CSV file with extracted features.

The CSV with extracted features has the following data fields:

Label (1 - Doh, 0 - regular HTTPS)

Data source

Duration

Minimal Inter-Packet Delay

Maximal Inter-Packet Delay

Average Inter-Packet Delay

A variance of Incoming Packet Sizes

A variance of Outgoing Packet Sizes

A ratio of the number of Incoming and outgoing bytes

A ration of the number of Incoming and outgoing packets

Average of Incoming Packet sizes

Average of Outgoing Packet sizes

The median value of Incoming Packet sizes

The median value of outgoing Packet sizes

The ratio of bursts and pauses

Number of bursts

Number of pauses

Autocorrelation

Transmission symmetry in the 1st third of connection

Transmission symmetry in the 2nd third of connection

Transmission symmetry in the last third of connection

The observed network traffic does not contain privacy-sensitive information.

The zip file structure is:

|-- data | |-- extracted-features...extracted features used in ML for DoH recognition | | |-- chrome | | |-- cloudflared | | -- firefox | |-- flows...............................................exported flow data | | |-- chrome | | |-- cloudflared | |-- firefox | -- pcaps....................................................raw PCAP data | |-- chrome | |-- cloudflared |-- firefox |-- LICENSE `-- README.md

When using this dataset, please cite the original work as follows:

@inproceedings{vekshin2020, author = {Vekshin, Dmitrii and Hynek, Karel and Cejka, Tomas}, title = {DoH Insight: Detecting DNS over HTTPS by Machine Learning}, year = {2020}, isbn = {9781450388337}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3407023.3409192}, doi = {10.1145/3407023.3409192}, booktitle = {Proceedings of the 15th International Conference on Availability, Reliability and Security}, articleno = {87}, numpages = {8}, keywords = {classification, DoH, DNS over HTTPS, machine learning, detection, datasets}, location = {Virtual Event, Ireland}, series = {ARES '20} }
Context Ad Clicks Dataset
kaggle.com
Updated Feb 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Möbius (2021). Context Ad Clicks Dataset [Dataset]. https://www.kaggle.com/arashnic/ctrtest/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 9, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Möbius
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The dataset generated by an E-commerce website which sells a variety of products at its online platform. The records user behaviour of its customers and stores it as a log. However, most of the times, users do not buy the products instantly and there is a time gap during which the customer might surf the internet and maybe visit competitor websites. Now, to improve sales of products, website owner has hired an Adtech company which built a system such that ads are being shown for owner products on its partner websites. If a user comes to owner website and searches for a product, and then visits these partner websites or apps, his/her previously viewed items or their similar items are shown on as an ad. If the user clicks this ad, he/she will be redirected to the owner website and might buy the product.

The task is to predict the probability i.e. probability of user clicking the ad which is shown to them on the partner websites for the next 7 days on the basis of historical view log data, ad impression data and user data.

Content

You are provided with the view log of users (2018/10/15 - 2018/12/11) and the product description collected from the owner website. We also provide the training data and test data containing details for ad impressions at the partner websites(Train + Test). Train data contains the impression logs during 2018/11/15 – 2018/12/13 along with the label which specifies whether the ad is clicked or not. Your model will be evaluated on the test data which have impression logs during 2018/12/12 – 2018/12/18 without the labels. You are provided with the following files:

train.zip: This contains 3 files and description of each is given below:

train.csv

view_log.csv

item_data.csv

test.csv: test file contains the impressions for which the participants need to predict the click rate sample_submission.csv: This file contains the format in which you have to submit your predictions.

Inspiration

Predict the probability probability of user clicking the ad which is shown to them on the partner websites for the next 7 days on the basis of historical view log data, ad impression data and user data.

The evaluated metric could be "area under the ROC curve" between the predicted probability and the observed target.
A
‘NYC.gov Web Analytics’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘NYC.gov Web Analytics’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/data-gov-nyc-gov-web-analytics-2099/a81c8303/?iid=003-137&v=presentation
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
New York
Description
Analysis of ‘NYC.gov Web Analytics’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/f2b7ec11-c2ad-412c-8a63-914f40515c4d on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Web traffic statistics for the top 2000 most visited pages on nyc.gov by month.

--- Original source retains full ownership of the source dataset ---
YouTube's Channels Dataset
kaggle.com
Updated Mar 31, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HarshitHGupta (2021). YouTube's Channels Dataset [Dataset]. https://www.kaggle.com/datasets/harshithgupta/youtubes-channels-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 31, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
HarshitHGupta
Area covered
YouTube
Description
Context

YouTube is an American online video-sharing platform headquartered in San Bruno, California. The service, created in February 2005 by three former PayPal employees—Chad Hurley, Steve Chen, and Jawed Karim—was bought by Google in November 2006 for US$1.65 billion and now operates as one of the company's subsidiaries. YouTube is the second most-visited website after Google Search, according to Alexa Internet rankings.

YouTube allows users to upload, view, rate, share, add to playlists, report, comment on videos, and subscribe to other users. Available content includes video clips, TV show clips, music videos, short and documentary films, audio recordings, movie trailers, live streams, video blogging, short original videos, and educational videos.

YouTube (the world-famous video sharing website) maintains a list of the top trending videos on the platform. According to Variety magazine, “To determine the year’s top-trending videos, YouTube uses a combination of factors including measuring users interactions (number of views, shares, comments, and likes). Note that they’re not the most-viewed videos overall for the calendar year”. Top performers on the YouTube trending list are music videos (such as the famously virile “Gangam Style”), celebrity and/or reality TV performances, and the random dude-with-a-camera viral videos that YouTube is well-known for.

This dataset is a daily record of the top trending YouTube videos.

Note that this dataset is a structurally improved version of this dataset.

Acknowledgements

This dataset was collected using the YouTube API. This Description is cited in Wikipedia.
o
Artisanal mining site visits in Eastern DRC - Dataset - openAFRICA
open.africa
Updated Feb 7, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). Artisanal mining site visits in Eastern DRC - Dataset - openAFRICA [Dataset]. https://open.africa/dataset/artisanal-mining-site-visits-in-eastern-drc
Explore at:
Dataset updated
Feb 7, 2019
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Democratic Republic of the Congo
Description
IPIS has collected data on artisanal mining sites since 2009, and made it publicly accessible on webmaps and in analytical reports. The upgraded map presents new mining sites, bringing the total to more than 2400 sites visited as recently as December 2017. New information on the mining sites has been included. A new layer has been added displaying hundreds of roadblocks. The latest update of the map has been supported by the International Organization for Migration (IOM) in the DRC, through the USAID funded Responsible Minerals Trade (RMT) project
Z
Greek privacy policies dataset from PCI 2023 paper: "A privacy policies...
data.niaid.nih.gov
zenodo.org
Updated Dec 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Papoutsoglou, Maria (2023). Greek privacy policies dataset from PCI 2023 paper: "A privacy policies dataset in Greek in the GDPR era" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10435880
Explore at:
Dataset updated
Dec 27, 2023
Dataset provided by
Kapitsaki, Georgia
Papoutsoglou, Maria
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A dataset of privacy policies in the Greek language, with policies coming from top visited websites in Greece with a privacy policy in the Greek language.

The dataset, as well as results of its analysis are included.

if you want to use this dataset, please cite the relevant conference publication:

Georgia M. Kapitsaki and Maria Papoutsoglou, "A privacy policies dataset in Greek in the GDPR era, in Proceedings of the 27th Pan-Hellenic Conference on Informatics, PCI 2023.
A
Most Viewed Digital Records in City Archives Digital Repository
data.boston.gov
csv
Updated Apr 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archives and Record Management (2019). Most Viewed Digital Records in City Archives Digital Repository [Dataset]. https://data.boston.gov/dataset/most-viewed-digital-records-in-city-archives-digital-repository
Explore at:
csv(7671), csv, csv(7531), csv(7791), csv(7559), csv(7760), csv(7735), csv(7727), csv(7393)Available download formats
Dataset updated
Apr 12, 2019
Dataset authored and provided by
Archives and Record Management
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
Monthly statistics for most viewed digital records in the City Archives Digital Repository.
March Madness Historical DataSet (2002 to 2025)
kaggle.com
Updated Apr 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Pilafas (2025). March Madness Historical DataSet (2002 to 2025) [Dataset]. https://www.kaggle.com/datasets/jonathanpilafas/2024-march-madness-statistical-analysis/discussion?sort=undefined
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 22, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jonathan Pilafas
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This Kaggle dataset comes from an output dataset that powers my March Madness Data Analysis dashboard in Domo. - Click here to view this dashboard: Dashboard Link - Click here to view this dashboard features in a Domo blog post: Hoops, Data, and Madness: Unveiling the Ultimate NCAA Dashboard

This dataset offers one the most robust resource you will find to discover key insights through data science and data analytics using historical NCAA Division 1 men's basketball data. This data, sourced from KenPom, goes as far back as 2002 and is updated with the latest 2025 data. This dataset is meticulously structured to provide every piece of information that I could pull from this site as an open-source tool for analysis for March Madness.

Key features of the dataset include: - Historical Data: Provides all historical KenPom data from 2002 to 2025 from the Efficiency, Four Factors (Offense & Defense), Point Distribution, Height/Experience, and Misc. Team Stats endpoints from KenPom's website. Please note that the Height/Experience data only goes as far back as 2007, but every other source contains data from 2002 onward. - Data Granularity: This dataset features an individual line item for every NCAA Division 1 men's basketball team in every season that contains every KenPom metric that you can possibly think of. This dataset has the ability to serve as a single source of truth for your March Madness analysis and provide you with the granularity necessary to perform any type of analysis you can think of. - 2025 Tournament Insights: Contains all seed and region information for the 2025 NCAA March Madness tournament. Please note that I will continually update this dataset with the seed and region information for previous tournaments as I continue to work on this dataset.

These datasets were created by downloading the raw CSV files for each season for the various sections on KenPom's website (Efficiency, Offense, Defense, Point Distribution, Summary, Miscellaneous Team Stats, and Height). All of these raw files were uploaded to Domo and imported into a dataflow using Domo's Magic ETL. In these dataflows, all of the column headers for each of the previous seasons are standardized to the current 2025 naming structure so all of the historical data can be viewed under the exact same field names. All of these cleaned datasets are then appended together, and some additional clean up takes place before ultimately creating the intermediate (INT) datasets that are uploaded to this Kaggle dataset. Once all of the INT datasets were created, I joined all of the tables together on the team name and season so all of these different metrics can be viewed under one single view. From there, I joined an NCAAM Conference & ESPN Team Name Mapping table to add a conference field in its full length and respective acronyms they are known by as well as the team name that ESPN currently uses. Please note that this reference table is an aggregated view of all of the different conferences a team has been a part of since 2002 and the different team names that KenPom has used historically, so this mapping table is necessary to map all of the teams properly and differentiate the historical conferences from their current conferences. From there, I join a reference table that includes all of the current NCAAM coaches and their active coaching lengths because the active current coaching length typically correlates to a team's success in the March Madness tournament. I also join another reference table to include the historical post-season tournament teams in the March Madness, NIT, CBI, and CIT tournaments, and I join another reference table to differentiate the teams who were ranked in the top 12 in the AP Top 25 during week 6 of the respective NCAA season. After some additional data clean-up, all of this cleaned data exports into the "DEV _ March Madness" file that contains the consolidated view of all of this data.

This dataset provides users with the flexibility to export data for further analysis in platforms such as Domo, Power BI, Tableau, Excel, and more. This dataset is designed for users who wish to conduct their own analysis, develop predictive models, or simply gain a deeper understanding of the intricacies that result in the excitement that Division 1 men's college basketball provides every year in March. Whether you are using this dataset for academic research, personal interest, or professional interest, I hope this dataset serves as a foundational tool for exploring the vast landscape of college basketball's most riveting and anticipated event of its season.
Uplift Modeling , Marketing Campaign Data
kaggle.com
zip
Updated Nov 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Möbius (2020). Uplift Modeling , Marketing Campaign Data [Dataset]. https://www.kaggle.com/arashnic/uplift-modeling
Explore at:
zip(340156703 bytes)Available download formats
Dataset updated
Nov 1, 2020
Authors
Möbius
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Uplift modeling is an important yet novel area of research in machine learning which aims to explain and to estimate the causal impact of a treatment at the individual level. In the digital advertising industry, the treatment is exposure to different ads and uplift modeling is used to direct marketing efforts towards users for whom it is the most efficient . The data is a collection collection of 13 million samples from a randomized control trial, scaling up previously available datasets by a healthy 590x factor.

###
###

Content

The dataset was created by The Criteo AI Lab .The dataset consists of 13M rows, each one representing a user with 12 features, a treatment indicator and 2 binary labels (visits and conversions). Positive labels mean the user visited/converted on the advertiser website during the test period (2 weeks). The global treatment ratio is 84.6%. It is usual that advertisers keep only a small control population as it costs them in potential revenue.

Following is a detailed description of the features:

f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11: feature values (dense, float)

treatment: treatment group (1 = treated, 0 = control)

conversion: whether a conversion occured for this user (binary, label)

visit: whether a visit occured for this user (binary, label)

exposure: treatment effect, whether the user has been effectively exposed (binary)

###

Context

Uplift modeling is an important yet novel area of research in machine learning which aims to explain and to estimate the causal impact of a treatment at the individual level. In the digital advertising industry, the treatment is exposure to different ads and uplift modeling is used to direct marketing efforts towards users for whom it is the most efficient . The data is a collection collection of 13 million samples from a randomized control trial, scaling up previously available datasets by a healthy 590x factor.

###
###

Content

The dataset was created by The Criteo AI Lab .The dataset consists of 13M rows, each one representing a user with 12 features, a treatment indicator and 2 binary labels (visits and conversions). Positive labels mean the user visited/converted on the advertiser website during the test period (2 weeks). The global treatment ratio is 84.6%. It is usual that advertisers keep only a small control population as it costs them in potential revenue.

Following is a detailed description of the features:

f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11: feature values (dense, float)

treatment: treatment group (1 = treated, 0 = control)

conversion: whether a conversion occured for this user (binary, label)

visit: whether a visit occured for this user (binary, label)

exposure: treatment effect, whether the user has been effectively exposed (binary)

###

Starter Kernels

HistGradientBoostingClassifier Base Model

Acknowledgement

The data provided for paper: "A Large Scale Benchmark for Uplift Modeling"

https://s3.us-east-2.amazonaws.com/criteo-uplift-dataset/large-scale-benchmark.pdf

Eustache Diemert CAIL e.diemert@criteo.com

Artem Betlei CAIL & Université Grenoble Alpes a.betlei@criteo.com

Christophe Renaudin CAIL c.renaudin@criteo.com

Massih-Reza Amini Université Grenoble Alpes massih-reza.amini@imag.fr

For privacy reasons the data has been sub-sampled non-uniformly so that the original incrementality level cannot be deduced from the dataset while preserving a realistic, challenging benchmark. Feature names have been anonymized and their values randomly projected so as to keep predictive power while making it practically impossible to recover the original features or user context.

Inspiration

We can foresee related usages such as but not limited to:

Uplift modeling

Interactions between features and treatment

Heterogeneity of treatment

More Readings

Supercharging customer touchpoints with uplift modeling

CasualML

PyLift

MORE DATASETs ...
TED Talks
kaggle.com
data.wu.ac.at
zip
Updated Sep 9, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rounak Banik (2017). TED Talks [Dataset]. https://www.kaggle.com/rounakbanik/ted-talks
Explore at:
zip(62193209 bytes)Available download formats
Dataset updated
Sep 9, 2017
Authors
Rounak Banik
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Context

These datasets contain information about all audio-video recordings of TED Talks uploaded to the official TED.com website until September 2012. The TED favorites dataset contains information about the videos, registered users have favorited. The TED Talks dataset contains information about all talks including number of views, number of comments, descriptions, speakers and titles.

The original datasets (in the JSON format) contain all the aforementioned information and in addition, also contain all the data related to content and replies.

Content (for the CSV files)

TED Talks

id: The ID of the talk. Has no inherent meaning. Values obtained by resetting index.

url: Url pointing to the talk on http://www.ted.com

title: The title of the talk

description: Short description of the talk

transcript: The human-made transcript in English of the talk as available on the TED website

related_tags: The related tags that refer to the talk assigned by TED editorial staff

related_themes: The related themes that refer to the talk assigned by TED editorial staff (as of April 2013, these themes are no longer visible on the TED website)

related_videos: A list of videos that are related to the given talk, assigned by TED editorial staff

ted_event: The name of the event in which the talk was presented

speaker: The full name of the speaker of the talk

publish_date: The day on which the talk was published

film_date: The day on which the talk was filmed

views: The total number of views of the talk at the time of crawling

comments: The number of comments on the talk

TED Favorites

talk: Name of the talk being favorited.

user: The ID of the user favoriting the talk. Has no inherent meaning. This ID is INDEPENDENT of the talk ID of the aforementioned Talks dataset.

Acknowledgements

The original dataset was obtained from https://www.idiap.ch/dataset/ted and was in the JSON Format. Taken verbatim from the website:

The metadata was obtained by crawling the HTML source of the list of talks and users, as well as talk and user webpages using scripts written by Nikolaos Pappas at the Idiap Research Institute, Martigny, Switzerland. The dataset is shared under the Creative Commons license (the same as the content of the TED talks) which is stored in the COPYRIGHT file. The dataset is shared for research purposes which are explained in detail in the following papers. The dataset can be used to benchmark systems that perform two tasks, namely personalized recommendations and generic recommendations. Please check the CBMI 2013 paper for a detailed description of each task.

Nikolaos Pappas, Andrei Popescu-Belis, "Combining Content with User Preferences for TED Lecture Recommendation", 11th International Workshop on Content Based Multimedia Indexing, Veszprém, Hungary, IEEE, 2013

Nikolaos Pappas, Andrei Popescu-Belis, Sentiment Analysis of User Comments for One-Class Collaborative Filtering over TED Talks, 36th ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, ACM, 2013

The datasets uploaded were used by the second paper listed above.

The ones available here in the CSV format do not include the text data of comments. Instead, they just give you the number of comments on each talk.

Inspiration

I've always been fascinated by TED Talks and the immense diversity of content that it provides for free. I was also thoroughly inspired by a TED Talk that visually explored TED Talks stats and I was motivated to do the same thing, albeit on a much less grander scale.

Some of the questions that can be answered with this dataset: 1. How is each TED Talk related to every other TED Talk? 2. Which are the most viewed and most favorited Talks of all time? Are they mostly the same? What does this tell us? 3. What kind of topics attract the maximum discussion and debate (in the form of comments)? 4. Which months are most popular among TED and TEDx chapters?

Imgur Most Viral and Secret Santa

kaggle.com

Updated Apr 18, 2020

Facebook

Twitter

Click to copy link

Link copied

Cite

Ghalib93 (2020). Imgur Most Viral and Secret Santa [Dataset]. https://www.kaggle.com/ghalib93/imgur-most-viral-and-secret-santa/code

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Apr 18, 2020

Dataset provided by

Kaggle

Authors

Ghalib93

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Context

Imgur is an image hosting and sharing website founded in 2009. It became one of the most popular websites around the world with approximately 250 million users. The website does not require registration and anyone can browse its content. However, to be able to post an account must be created. It is famous for an event that it created in 2013 where members get to register to send/receive gifts from other members on the website. The event takes place during Christmas time and people share their gifts via the website where they post pictures of the process or what they received in a specific tag. Today the data provided covers two sections that I think are important to understanding certain patterns within the Imgur community. The first is the Most Viral section and the second is the Secret Santa tag.

I have participated twice in The Imgur secret Santa event and always found funny and interesting post from its most viral section. I would like with the help of the Kaggle community to identify trends from the data provided and maybe make a comparison between the Secret Santa data and the most viral.

Content

There are two Dataframes included and they are almost identical in the number of columns:

The first Dataframe is Imgur Most Viral posts. This contains many of the posts that were labelled as Viral by The Imgur community and team using specific algorithms to track number of likes and dislikes across multiple platforms. The posts might be videos, gifs, pictures or just text.

The second Dataframe is Imgur Secret Santa Tag. Secret Santa is an annual Imgur tradition where members can sign up to send gifts to and receive gifts from other members during the Christmas holiday.This contains many of the posts that were tagged with Secret Santa by the Imgur community. The posts might be videos, gifs, pictures or just text. There is a (is_viral) column in this Dataframe that is not available in the Most Viral Dataframe since all of the posts there are viral.

Data Dictionary

Feature	Type	Dataset	Description
account_id	object	Imgur_Viral/imgur_secret_santa	Unique Account ID per member
comment_count	float64	Imgur_Viral/imgur_secret_santa	Number of comments made in the post
datetime	float64	Imgur_Viral/imgur_secret_santa	TimeStamp containing Date and Time Details
downs	float64	Imgur_Viral/imgur_secret_santa	Number of dislikes for the post
favorite_count	float64	Imgur_Viral/imgur_secret_santa	Number of user that marked the post as a favourite
id	object	Imgur_Viral/imgur_secret_santa	Uniqe Post ID. Even if it was posted by the same member, different posts will have different IDs
images_count	float64	Imgur_Viral/imgur_secret_santa	Number of images included in the post
points	float64	Imgur_Viral/imgur_secret_santa	Each post will have calculated points based on (ups - downs)
score	float64	Imgur_Viral/imgur_secret_santa	Ticket number
tags	object	Imgur_Viral/imgur_secret_santa	Tags are sub albums that the post will show under
title	object	Imgur_Viral/imgur_secret_santa	Title of the post
ups	float64	Imgur_Viral/imgur_secret_santa	Number of likes for the post
views	float64	Imgur_Viral/imgur_secret_santa	Number of people that viewed the post
is_most_viral	boolean	imgur_secret_santa	If the post is viral or not

Acknowledgements

I would like to thank imgur for providing an API that made collecting data easier from its website. With their help we might be able to better understand certain trends that emerge from its community

Inspiration

There is no problem to solve from this data, but it just a fun way to explore and learn more about programming and analyzing data. I hope you enjoy playing with the data as much as I did collecting it and browsing the website

C
Website municipality of Utrecht
ckan.mobidatalab.eu
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OverheidNl (2023). Website municipality of Utrecht [Dataset]. https://ckan.mobidatalab.eu/dataset/utrecht-website-gemeente-utrecht
Explore at:
http://publications.europa.eu/resource/authority/file-type/csv, http://publications.europa.eu/resource/authority/file-type/zip, http://publications.europa.eu/resource/authority/file-type/psdAvailable download formats
Dataset updated
Jul 13, 2023
Dataset provided by
OverheidNl
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Utrecht
Description
This is a combination of the following datasets: * Icons website Municipality of Utrecht * Most visited topics website municipality of Utrecht < a name="iconen-website-gemeente-utrecht"> #### Iconen website Gemeente Utrecht For the website of the Municipality of Utrecht, 45 different icons have been developed. The datasets in PNG and PSD (Photoshop) format are shown below, and there is also a preview in which the icons are shown. Examples of available icons are: * zoning plan; * notification; * to marry; * integration. #### Most visited topics website municipality of Utrecht Overview of the topics most searched for on the website municipality of Utrecht (www.utrecht .NL). This information is presented per month and includes the following data per object: * click path on website (where subject can be found); * number of page views; * average time of visit to website. * link web site.
Q
Data for: Debating Algorithmic Fairness
data.qdr.syr.edu
Updated Nov 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Melissa Hamilton; Melissa Hamilton (2023). Data for: Debating Algorithmic Fairness [Dataset]. http://doi.org/10.5064/F6JOQXNF
Explore at:
pdf(53179), pdf(63339), pdf(285052), pdf(103333), application/x-json-hypothesis(55745), pdf(256399), jpeg(101993), pdf(233414), pdf(536400), pdf(786428), pdf(2243113), pdf(109638), pdf(176988), pdf(59204), pdf(124046), pdf(802960), pdf(82120)Available download formats
Unique identifier
https://doi.org/10.5064/F6JOQXNF
Dataset updated
Nov 13, 2023
Dataset provided by
Qualitative Data Repository
Authors
Melissa Hamilton; Melissa Hamilton
License
https://qdr.syr.edu/policies/qdr-standard-access-conditionshttps://qdr.syr.edu/policies/qdr-standard-access-conditions
Time period covered
2008 - 2017
Area covered
United States
Description
This is an Annotation for Transparent Inquiry (ATI) data project. The annotated article can be viewed on the Publisher's Website. Data Generation The research project engages a story about perceptions of fairness in criminal justice decisions. The specific focus involves a debate between ProPublica, a news organization, and Northpointe, the owner of a popular risk tool called COMPAS. ProPublica wrote that COMPAS was racist against blacks, while Northpointe posted online a reply rejecting such a finding. These two documents were the obvious foci of the qualitative analysis because of the further media attention they attracted, the confusion their competing conclusions caused readers, and the power both companies wield in public circles. There were no barriers to retrieval as both documents have been publicly available on their corporate websites. This public access was one of the motivators for choosing them as it meant that they were also easily attainable by the general public, thus extending the documents’ reach and impact. Additional materials from ProPublica relating to the main debate were also freely downloadable from its website and a third party, open source platform. Access to secondary source materials comprising additional writings from Northpointe representatives that could assist in understanding Northpointe’s main document, though, was more limited. Because of a claim of trade secrets on its tool and the underlying algorithm, it was more difficult to reach Northpointe’s other reports. Nonetheless, largely because its clients are governmental bodies with transparency and accountability obligations, some of Northpointe-associated reports were retrievable from third parties who had obtained them, largely through Freedom of Information Act queries. Together, the primary and (retrievable) secondary sources allowed for a triangulation of themes, arguments, and conclusions. The quantitative component uses a dataset of over 7,000 individuals with information that was collected and compiled by ProPublica and made available to the public on github. ProPublica’s gathering the data directly from criminal justice officials via Freedom of Information Act requests rendered the dataset in the public domain, and thus no confidentiality issues are present. The dataset was loaded into SPSS v. 25 for data analysis. Data Analysis The qualitative enquiry used critical discourse analysis, which investigates ways in which parties in their communications attempt to create, legitimate, rationalize, and control mutual understandings of important issues. Each of the two main discourse documents was parsed on its own merit. Yet the project was also intertextual in studying how the discourses correspond with each other and to other relevant writings by the same authors. Several more specific types of discursive strategies were of interest in attracting further critical examination: Testing claims and rationalizations that appear to serve the speaker’s self-interest Examining conclusions and determining whether sufficient evidence supported them Revealing contradictions and/or inconsistencies within the same text and intertextually Assessing strategies underlying justifications and rationalizations used to promote a party’s assertions and arguments Noticing strategic deployment of lexical phrasings, syntax, and rhetoric Judging sincerity of voice and the objective consideration of alternative perspectives Of equal importance in a critical discourse analysis is consideration of what is not addressed, that is to uncover facts and/or topics missing from the communication. For this project, this included parsing issues that were either briefly mentioned and then neglected, asserted yet the significance left unstated, or not suggested at all. This task required understanding common practices in the algorithmic data science literature. The paper could have been completed with just the critical discourse analysis. However, because one of the salient findings from it highlighted that the discourses overlooked numerous definitions of algorithmic fairness, the call to fill this gap seemed obvious. Then, the availability of the same dataset used by the parties in conflict, made this opportunity more appealing. Calculating additional algorithmic equity equations would not thereby be troubled by irregularities because of diverse sample sets. New variables were created as relevant to calculate algorithmic fairness equations. In addition to using various SPSS Analyze functions (e.g., regression, crosstabs, means), online statistical calculators were useful to compute z-test comparisons of proportions and t-test comparisons of means. Logic of Annotation Annotations were employed to fulfil a variety of functions, including supplementing the main text with context, observations, counter-points, analysis, and source attributions. These fall under a few categories. Space considerations. Critical discourse analysis offers a rich method...

Facebook

Twitter

Click to copy link

Link copied

Cite

DNS_dataset (2021). Traces captured by visiting the top 1500 website [Dataset]. https://www.kaggle.com/jacksontang16/traces-captured-by-visiting-the-top-1500-website

Traces captured by visiting the top 1500 website

Traffic captured by visiting the top 1500 most visited sites ranked by Alexa

Explore at:

zip(5852806 bytes)Available download formats

Dataset updated

Aug 25, 2021

Authors

DNS_dataset

Description

Dataset

This dataset was created by DNS_dataset

Clear search

Close search

Google apps

Main menu

Traces captured by visiting the top 1500 website

Dataset

Contents

Most visited websites by hierachycal categories

Context

Content

Acknowledgements

Inspiration

(Dataset) The most visited health websites in the world

Website Fingerprinting Dataset of Browsing Network Traffic for Desktop and...

Colombia: most visited websites 2024, by unique visitors

Share of top U.S. websites ignoring user privacy preferences 2024

NYC.gov Web Analytics

Dataset used for detecting DNS over HTTPS by Machine Learning.

Context Ad Clicks Dataset

Context

Content

Inspiration

‘NYC.gov Web Analytics’ analyzed by Analyst-2

YouTube's Channels Dataset

Context

Acknowledgements

Artisanal mining site visits in Eastern DRC - Dataset - openAFRICA

Greek privacy policies dataset from PCI 2023 paper: "A privacy policies...

Most Viewed Digital Records in City Archives Digital Repository

March Madness Historical DataSet (2002 to 2025)

Uplift Modeling , Marketing Campaign Data

Context

Content

Context

Content

Starter Kernels

Acknowledgement

Inspiration

More Readings

MORE DATASETs ...

TED Talks

Context

Content (for the CSV files)

TED Talks

TED Favorites

Acknowledgements

Inspiration

Imgur Most Viral and Secret Santa

Context

Content

Data Dictionary

Acknowledgements

Inspiration

Website municipality of Utrecht

Data for: Debating Algorithmic Fairness

Traces captured by visiting the top 1500 website

Traffic captured by visiting the top 1500 most visited sites ranked by Alexa

Dataset

Contents