This dataset was created by DNS_dataset
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Alexa Internet was founded in April 1996 by Brewster Kahle and Bruce Gilliat. The company's name was chosen in homage to the Library of Alexandria of Ptolemaic Egypt, drawing a parallel between the largest repository of knowledge in the ancient world and the potential of the Internet to become a similar store of knowledge. (from Wikipedia)
The categories list was going out by September, 17h, 2020. So I would like to save it. https://support.alexa.com/hc/en-us/articles/360051913314
This dataset was elaborated by this python script (V2.0): https://github.com/natanael127/dump-alexa-ranking
The sites are grouped in 17 macro categories and this tree ends having more than 360.000 nodes. Subjects are very organized and each of them has its own rank of most accessed domains. So, even the keys of a sub-dictionary may be a good small dataset to use.
Thank you my friend André (https://github.com/andrerclaudio) by helping me with tips of Google Colaboratory and computational power to get the data until our deadline.
Alexa ranking was inspired by Library of Alexandria. In the modern world, it may be a good start for AI know more about many, many subjects of the world.
Evaluation of the most visited health websites in the world
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a dataset of Tor cell file extracted from browsing simulation using Tor Browser. The simulations cover both desktop and mobile webpages. The data collection process was using WFP-Collector tool (https://github.com/irsyadpage/WFP-Collector). All the neccessary configuration to perform the simulation as detailed in the tool repository.The webpage URL is selected by using the first 100 website based on: https://dataforseo.com/free-seo-stats/top-1000-websites.Each webpage URL is visited 90 times for each deskop and mobile browsing mode.
In November 2024, Google.com was the leading website in Colombia by unique visits, with around 52.9 million single accesses to the URL during that month. YouTube.com came in second with approximately 30.9 million unique monthly visits. Facebook ranked third with 24.2 million unique monthly visits.
As of September 2024, 75 percent of the 100 most visited websites in the United States shared personal data with advertising 3rd parties, even when users opted out. Moreover, 70 percent of them drop advertising 3rd party cookies even when users opt out.
Web traffic statistics for the top 2000 most visited pages on nyc.gov by month.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The dataset consists of three different data sources:
DoH enabled Firefox
DoH enabled Google Chrome
Cloudflared DoH proxy
The capture of web browser data was made using the Selenium framework, which simulated classical user browsing. The browsers received command for visiting domains taken from Alexa's top 10K most visited websites. The capturing was performed on the host by listening to the network interface of the virtual machine. Overall the dataset contains almost 5,000 web-page visits by Mozilla and 1,000 pages visited by Chrome.
The Cloudflared DoH proxy was installed in Raspberry PI, and the IP address of the Raspberry was set as the default DNS resolver in two separate offices in our university. It was continuously capturing the DNS/DoH traffic created up to 20 devices for around three months.
The dataset contains 1,128,904 flows from which is around 33,000 labeled as DoH. We provide raw pcap data, CSV with flow data, and CSV file with extracted features.
The CSV with extracted features has the following data fields:
The observed network traffic does not contain privacy-sensitive information.
The zip file structure is:
|-- data
| |-- extracted-features...extracted features used in ML for DoH recognition
| | |-- chrome
| | |-- cloudflared
| | -- firefox
| |-- flows...............................................exported flow data
| | |-- chrome
| | |-- cloudflared
| |
-- firefox
| -- pcaps....................................................raw PCAP data
| |-- chrome
| |-- cloudflared
|
-- firefox
|-- LICENSE
`-- README.md
When using this dataset, please cite the original work as follows:
@inproceedings{vekshin2020, author = {Vekshin, Dmitrii and Hynek, Karel and Cejka, Tomas}, title = {DoH Insight: Detecting DNS over HTTPS by Machine Learning}, year = {2020}, isbn = {9781450388337}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3407023.3409192}, doi = {10.1145/3407023.3409192}, booktitle = {Proceedings of the 15th International Conference on Availability, Reliability and Security}, articleno = {87}, numpages = {8}, keywords = {classification, DoH, DNS over HTTPS, machine learning, detection, datasets}, location = {Virtual Event, Ireland}, series = {ARES '20} }
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The dataset generated by an E-commerce website which sells a variety of products at its online platform. The records user behaviour of its customers and stores it as a log. However, most of the times, users do not buy the products instantly and there is a time gap during which the customer might surf the internet and maybe visit competitor websites. Now, to improve sales of products, website owner has hired an Adtech company which built a system such that ads are being shown for owner products on its partner websites. If a user comes to owner website and searches for a product, and then visits these partner websites or apps, his/her previously viewed items or their similar items are shown on as an ad. If the user clicks this ad, he/she will be redirected to the owner website and might buy the product.
The task is to predict the probability i.e. probability of user clicking the ad which is shown to them on the partner websites for the next 7 days on the basis of historical view log data, ad impression data and user data.
You are provided with the view log of users (2018/10/15 - 2018/12/11) and the product description collected from the owner website. We also provide the training data and test data containing details for ad impressions at the partner websites(Train + Test). Train data contains the impression logs during 2018/11/15 – 2018/12/13 along with the label which specifies whether the ad is clicked or not. Your model will be evaluated on the test data which have impression logs during 2018/12/12 – 2018/12/18 without the labels. You are provided with the following files:
item_data.csv
The evaluated metric could be "area under the ROC curve" between the predicted probability and the observed target.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘NYC.gov Web Analytics’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/f2b7ec11-c2ad-412c-8a63-914f40515c4d on 28 January 2022.
--- Dataset description provided by original source is as follows ---
Web traffic statistics for the top 2000 most visited pages on nyc.gov by month.
--- Original source retains full ownership of the source dataset ---
YouTube is an American online video-sharing platform headquartered in San Bruno, California. The service, created in February 2005 by three former PayPal employees—Chad Hurley, Steve Chen, and Jawed Karim—was bought by Google in November 2006 for US$1.65 billion and now operates as one of the company's subsidiaries. YouTube is the second most-visited website after Google Search, according to Alexa Internet rankings.
YouTube allows users to upload, view, rate, share, add to playlists, report, comment on videos, and subscribe to other users. Available content includes video clips, TV show clips, music videos, short and documentary films, audio recordings, movie trailers, live streams, video blogging, short original videos, and educational videos.
YouTube (the world-famous video sharing website) maintains a list of the top trending videos on the platform. According to Variety magazine, “To determine the year’s top-trending videos, YouTube uses a combination of factors including measuring users interactions (number of views, shares, comments, and likes). Note that they’re not the most-viewed videos overall for the calendar year”. Top performers on the YouTube trending list are music videos (such as the famously virile “Gangam Style”), celebrity and/or reality TV performances, and the random dude-with-a-camera viral videos that YouTube is well-known for.
This dataset is a daily record of the top trending YouTube videos.
Note that this dataset is a structurally improved version of this dataset.
This dataset was collected using the YouTube API. This Description is cited in Wikipedia.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IPIS has collected data on artisanal mining sites since 2009, and made it publicly accessible on webmaps and in analytical reports. The upgraded map presents new mining sites, bringing the total to more than 2400 sites visited as recently as December 2017. New information on the mining sites has been included. A new layer has been added displaying hundreds of roadblocks. The latest update of the map has been supported by the International Organization for Migration (IOM) in the DRC, through the USAID funded Responsible Minerals Trade (RMT) project
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A dataset of privacy policies in the Greek language, with policies coming from top visited websites in Greece with a privacy policy in the Greek language.
The dataset, as well as results of its analysis are included.
if you want to use this dataset, please cite the relevant conference publication:
Georgia M. Kapitsaki and Maria Papoutsoglou, "A privacy policies dataset in Greek in the GDPR era, in Proceedings of the 27th Pan-Hellenic Conference on Informatics, PCI 2023.
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Monthly statistics for most viewed digital records in the City Archives Digital Repository.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This Kaggle dataset comes from an output dataset that powers my March Madness Data Analysis dashboard in Domo. - Click here to view this dashboard: Dashboard Link - Click here to view this dashboard features in a Domo blog post: Hoops, Data, and Madness: Unveiling the Ultimate NCAA Dashboard
This dataset offers one the most robust resource you will find to discover key insights through data science and data analytics using historical NCAA Division 1 men's basketball data. This data, sourced from KenPom, goes as far back as 2002 and is updated with the latest 2025 data. This dataset is meticulously structured to provide every piece of information that I could pull from this site as an open-source tool for analysis for March Madness.
Key features of the dataset include: - Historical Data: Provides all historical KenPom data from 2002 to 2025 from the Efficiency, Four Factors (Offense & Defense), Point Distribution, Height/Experience, and Misc. Team Stats endpoints from KenPom's website. Please note that the Height/Experience data only goes as far back as 2007, but every other source contains data from 2002 onward. - Data Granularity: This dataset features an individual line item for every NCAA Division 1 men's basketball team in every season that contains every KenPom metric that you can possibly think of. This dataset has the ability to serve as a single source of truth for your March Madness analysis and provide you with the granularity necessary to perform any type of analysis you can think of. - 2025 Tournament Insights: Contains all seed and region information for the 2025 NCAA March Madness tournament. Please note that I will continually update this dataset with the seed and region information for previous tournaments as I continue to work on this dataset.
These datasets were created by downloading the raw CSV files for each season for the various sections on KenPom's website (Efficiency, Offense, Defense, Point Distribution, Summary, Miscellaneous Team Stats, and Height). All of these raw files were uploaded to Domo and imported into a dataflow using Domo's Magic ETL. In these dataflows, all of the column headers for each of the previous seasons are standardized to the current 2025 naming structure so all of the historical data can be viewed under the exact same field names. All of these cleaned datasets are then appended together, and some additional clean up takes place before ultimately creating the intermediate (INT) datasets that are uploaded to this Kaggle dataset. Once all of the INT datasets were created, I joined all of the tables together on the team name and season so all of these different metrics can be viewed under one single view. From there, I joined an NCAAM Conference & ESPN Team Name Mapping table to add a conference field in its full length and respective acronyms they are known by as well as the team name that ESPN currently uses. Please note that this reference table is an aggregated view of all of the different conferences a team has been a part of since 2002 and the different team names that KenPom has used historically, so this mapping table is necessary to map all of the teams properly and differentiate the historical conferences from their current conferences. From there, I join a reference table that includes all of the current NCAAM coaches and their active coaching lengths because the active current coaching length typically correlates to a team's success in the March Madness tournament. I also join another reference table to include the historical post-season tournament teams in the March Madness, NIT, CBI, and CIT tournaments, and I join another reference table to differentiate the teams who were ranked in the top 12 in the AP Top 25 during week 6 of the respective NCAA season. After some additional data clean-up, all of this cleaned data exports into the "DEV _ March Madness" file that contains the consolidated view of all of this data.
This dataset provides users with the flexibility to export data for further analysis in platforms such as Domo, Power BI, Tableau, Excel, and more. This dataset is designed for users who wish to conduct their own analysis, develop predictive models, or simply gain a deeper understanding of the intricacies that result in the excitement that Division 1 men's college basketball provides every year in March. Whether you are using this dataset for academic research, personal interest, or professional interest, I hope this dataset serves as a foundational tool for exploring the vast landscape of college basketball's most riveting and anticipated event of its season.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Uplift modeling is an important yet novel area of research in machine learning which aims to explain and to estimate the causal impact of a treatment at the individual level. In the digital advertising industry, the treatment is exposure to different ads and uplift modeling is used to direct marketing efforts towards users for whom it is the most efficient . The data is a collection collection of 13 million samples from a randomized control trial, scaling up previously available datasets by a healthy 590x factor.
###
###
The dataset was created by The Criteo AI Lab .The dataset consists of 13M rows, each one representing a user with 12 features, a treatment indicator and 2 binary labels (visits and conversions). Positive labels mean the user visited/converted on the advertiser website during the test period (2 weeks). The global treatment ratio is 84.6%. It is usual that advertisers keep only a small control population as it costs them in potential revenue.
Following is a detailed description of the features:
###
Uplift modeling is an important yet novel area of research in machine learning which aims to explain and to estimate the causal impact of a treatment at the individual level. In the digital advertising industry, the treatment is exposure to different ads and uplift modeling is used to direct marketing efforts towards users for whom it is the most efficient . The data is a collection collection of 13 million samples from a randomized control trial, scaling up previously available datasets by a healthy 590x factor.
###
###
The dataset was created by The Criteo AI Lab .The dataset consists of 13M rows, each one representing a user with 12 features, a treatment indicator and 2 binary labels (visits and conversions). Positive labels mean the user visited/converted on the advertiser website during the test period (2 weeks). The global treatment ratio is 84.6%. It is usual that advertisers keep only a small control population as it costs them in potential revenue.
Following is a detailed description of the features:
###
The data provided for paper: "A Large Scale Benchmark for Uplift Modeling"
https://s3.us-east-2.amazonaws.com/criteo-uplift-dataset/large-scale-benchmark.pdf
For privacy reasons the data has been sub-sampled non-uniformly so that the original incrementality level cannot be deduced from the dataset while preserving a realistic, challenging benchmark. Feature names have been anonymized and their values randomly projected so as to keep predictive power while making it practically impossible to recover the original features or user context.
We can foresee related usages such as but not limited to:
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
These datasets contain information about all audio-video recordings of TED Talks uploaded to the official TED.com website until September 2012. The TED favorites dataset contains information about the videos, registered users have favorited. The TED Talks dataset contains information about all talks including number of views, number of comments, descriptions, speakers and titles.
The original datasets (in the JSON format) contain all the aforementioned information and in addition, also contain all the data related to content and replies.
The original dataset was obtained from https://www.idiap.ch/dataset/ted and was in the JSON Format. Taken verbatim from the website:
The metadata was obtained by crawling the HTML source of the list of talks and users, as well as talk and user webpages using scripts written by Nikolaos Pappas at the Idiap Research Institute, Martigny, Switzerland. The dataset is shared under the Creative Commons license (the same as the content of the TED talks) which is stored in the COPYRIGHT file. The dataset is shared for research purposes which are explained in detail in the following papers. The dataset can be used to benchmark systems that perform two tasks, namely personalized recommendations and generic recommendations. Please check the CBMI 2013 paper for a detailed description of each task.
The datasets uploaded were used by the second paper listed above.
The ones available here in the CSV format do not include the text data of comments. Instead, they just give you the number of comments on each talk.
I've always been fascinated by TED Talks and the immense diversity of content that it provides for free. I was also thoroughly inspired by a TED Talk that visually explored TED Talks stats and I was motivated to do the same thing, albeit on a much less grander scale.
Some of the questions that can be answered with this dataset: 1. How is each TED Talk related to every other TED Talk? 2. Which are the most viewed and most favorited Talks of all time? Are they mostly the same? What does this tell us? 3. What kind of topics attract the maximum discussion and debate (in the form of comments)? 4. Which months are most popular among TED and TEDx chapters?
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Imgur is an image hosting and sharing website founded in 2009. It became one of the most popular websites around the world with approximately 250 million users. The website does not require registration and anyone can browse its content. However, to be able to post an account must be created. It is famous for an event that it created in 2013 where members get to register to send/receive gifts from other members on the website. The event takes place during Christmas time and people share their gifts via the website where they post pictures of the process or what they received in a specific tag. Today the data provided covers two sections that I think are important to understanding certain patterns within the Imgur community. The first is the Most Viral section and the second is the Secret Santa tag.
I have participated twice in The Imgur secret Santa event and always found funny and interesting post from its most viral section. I would like with the help of the Kaggle community to identify trends from the data provided and maybe make a comparison between the Secret Santa data and the most viral.
There are two Dataframes included and they are almost identical in the number of columns:
The first Dataframe is Imgur Most Viral posts. This contains many of the posts that were labelled as Viral by The Imgur community and team using specific algorithms to track number of likes and dislikes across multiple platforms. The posts might be videos, gifs, pictures or just text.
The second Dataframe is Imgur Secret Santa Tag. Secret Santa is an annual Imgur tradition where members can sign up to send gifts to and receive gifts from other members during the Christmas holiday.This contains many of the posts that were tagged with Secret Santa by the Imgur community. The posts might be videos, gifs, pictures or just text. There is a (is_viral) column in this Dataframe that is not available in the Most Viral Dataframe since all of the posts there are viral.
Feature | Type | Dataset | Description |
---|---|---|---|
account_id | object | Imgur_Viral/imgur_secret_santa | Unique Account ID per member |
comment_count | float64 | Imgur_Viral/imgur_secret_santa | Number of comments made in the post |
datetime | float64 | Imgur_Viral/imgur_secret_santa | TimeStamp containing Date and Time Details |
downs | float64 | Imgur_Viral/imgur_secret_santa | Number of dislikes for the post |
favorite_count | float64 | Imgur_Viral/imgur_secret_santa | Number of user that marked the post as a favourite |
id | object | Imgur_Viral/imgur_secret_santa | Uniqe Post ID. Even if it was posted by the same member, different posts will have different IDs |
images_count | float64 | Imgur_Viral/imgur_secret_santa | Number of images included in the post |
points | float64 | Imgur_Viral/imgur_secret_santa | Each post will have calculated points based on (ups - downs) |
score | float64 | Imgur_Viral/imgur_secret_santa | Ticket number |
tags | object | Imgur_Viral/imgur_secret_santa | Tags are sub albums that the post will show under |
title | object | Imgur_Viral/imgur_secret_santa | Title of the post |
ups | float64 | Imgur_Viral/imgur_secret_santa | Number of likes for the post |
views | float64 | Imgur_Viral/imgur_secret_santa | Number of people that viewed the post |
is_most_viral | boolean | imgur_secret_santa | If the post is viral or not |
I would like to thank imgur for providing an API that made collecting data easier from its website. With their help we might be able to better understand certain trends that emerge from its community
There is no problem to solve from this data, but it just a fun way to explore and learn more about programming and analyzing data. I hope you enjoy playing with the data as much as I did collecting it and browsing the website
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This is a combination of the following datasets: * Icons website Municipality of Utrecht * Most visited topics website municipality of Utrecht < a name="iconen-website-gemeente-utrecht"> #### Iconen website Gemeente Utrecht For the website of the Municipality of Utrecht, 45 different icons have been developed. The datasets in PNG and PSD (Photoshop) format are shown below, and there is also a preview in which the icons are shown. Examples of available icons are: * zoning plan; * notification; * to marry; * integration. #### Most visited topics website municipality of Utrecht Overview of the topics most searched for on the website municipality of Utrecht (www.utrecht .NL). This information is presented per month and includes the following data per object: * click path on website (where subject can be found); * number of page views; * average time of visit to website. * link web site.
https://qdr.syr.edu/policies/qdr-standard-access-conditionshttps://qdr.syr.edu/policies/qdr-standard-access-conditions
This is an Annotation for Transparent Inquiry (ATI) data project. The annotated article can be viewed on the Publisher's Website. Data Generation The research project engages a story about perceptions of fairness in criminal justice decisions. The specific focus involves a debate between ProPublica, a news organization, and Northpointe, the owner of a popular risk tool called COMPAS. ProPublica wrote that COMPAS was racist against blacks, while Northpointe posted online a reply rejecting such a finding. These two documents were the obvious foci of the qualitative analysis because of the further media attention they attracted, the confusion their competing conclusions caused readers, and the power both companies wield in public circles. There were no barriers to retrieval as both documents have been publicly available on their corporate websites. This public access was one of the motivators for choosing them as it meant that they were also easily attainable by the general public, thus extending the documents’ reach and impact. Additional materials from ProPublica relating to the main debate were also freely downloadable from its website and a third party, open source platform. Access to secondary source materials comprising additional writings from Northpointe representatives that could assist in understanding Northpointe’s main document, though, was more limited. Because of a claim of trade secrets on its tool and the underlying algorithm, it was more difficult to reach Northpointe’s other reports. Nonetheless, largely because its clients are governmental bodies with transparency and accountability obligations, some of Northpointe-associated reports were retrievable from third parties who had obtained them, largely through Freedom of Information Act queries. Together, the primary and (retrievable) secondary sources allowed for a triangulation of themes, arguments, and conclusions. The quantitative component uses a dataset of over 7,000 individuals with information that was collected and compiled by ProPublica and made available to the public on github. ProPublica’s gathering the data directly from criminal justice officials via Freedom of Information Act requests rendered the dataset in the public domain, and thus no confidentiality issues are present. The dataset was loaded into SPSS v. 25 for data analysis. Data Analysis The qualitative enquiry used critical discourse analysis, which investigates ways in which parties in their communications attempt to create, legitimate, rationalize, and control mutual understandings of important issues. Each of the two main discourse documents was parsed on its own merit. Yet the project was also intertextual in studying how the discourses correspond with each other and to other relevant writings by the same authors. Several more specific types of discursive strategies were of interest in attracting further critical examination: Testing claims and rationalizations that appear to serve the speaker’s self-interest Examining conclusions and determining whether sufficient evidence supported them Revealing contradictions and/or inconsistencies within the same text and intertextually Assessing strategies underlying justifications and rationalizations used to promote a party’s assertions and arguments Noticing strategic deployment of lexical phrasings, syntax, and rhetoric Judging sincerity of voice and the objective consideration of alternative perspectives Of equal importance in a critical discourse analysis is consideration of what is not addressed, that is to uncover facts and/or topics missing from the communication. For this project, this included parsing issues that were either briefly mentioned and then neglected, asserted yet the significance left unstated, or not suggested at all. This task required understanding common practices in the algorithmic data science literature. The paper could have been completed with just the critical discourse analysis. However, because one of the salient findings from it highlighted that the discourses overlooked numerous definitions of algorithmic fairness, the call to fill this gap seemed obvious. Then, the availability of the same dataset used by the parties in conflict, made this opportunity more appealing. Calculating additional algorithmic equity equations would not thereby be troubled by irregularities because of diverse sample sets. New variables were created as relevant to calculate algorithmic fairness equations. In addition to using various SPSS Analyze functions (e.g., regression, crosstabs, means), online statistical calculators were useful to compute z-test comparisons of proportions and t-test comparisons of means. Logic of Annotation Annotations were employed to fulfil a variety of functions, including supplementing the main text with context, observations, counter-points, analysis, and source attributions. These fall under a few categories. Space considerations. Critical discourse analysis offers a rich method...
This dataset was created by DNS_dataset