95 datasets found

e
Most popular websites in the Netherlands 2015 - Dataset - B2FIND
b2find.eudat.eu
Updated Jun 2, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2017). Most popular websites in the Netherlands 2015 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/3edeb59b-b49b-59cb-9757-9127aed7e8af
Explore at:
Dataset updated
Jun 2, 2017
Area covered
Netherlands
Description
This dataset contains a list of 3654 Dutch websites that we considered the most popular websites in 2015. This list served as whitelist for the Newstracker Research project in which we monitored the online web behaviour of a group of respondents.The research project 'The Newstracker' was a subproject of the NWO-funded project 'The New News Consumer: A User-Based Innovation Project to Meet Paradigmatic Change in News Use and Media Habits'.For the Newstracker project we aimed to understand the web behaviour of a group of respondents. We created custom-built software to monitor their web browsing behaviour on their laptops and desktops (please find the code in open access at https://github.com/NITechLabs/NewsTracker). For reasons of scale and privacy we created a whitelist with websites that were the most popular websites in 2015. We manually compiled this list by using data of DDMM, Alexa and own research. The dataset consists of 5 columns:- the URL- the type of website: We created a list of types of websites and each website has been manually labeled with 1 category- Nieuws-regio: When the category was 'News', we subdivided these websites in the regional focus: International, National or Local- Nieuws-onderwerp: Furthermore, each website under the category News was further subdivided in type of news website. For this we created an own list of news categories and manually coded each website- Bron: For each website we noted which source we used to find this website.The full description of the research design of the Newstracker including the set-up of this whitelist is included in the following article: Kleppe, M., Otte, M. (in print), 'Analysing & understanding news consumption patterns by tracking online user behaviour with a multimodal research design', Digital Scholarship in the Humanities, doi 10.1093/llc/fqx030.
Online Sales Dataset - Popular Marketplace Data
kaggle.com
Updated May 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ShreyanshVerma27 (2024). Online Sales Dataset - Popular Marketplace Data [Dataset]. https://www.kaggle.com/datasets/shreyanshverma27/online-sales-dataset-popular-marketplace-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 25, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
ShreyanshVerma27
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset provides a comprehensive overview of online sales transactions across different product categories. Each row represents a single transaction with detailed information such as the order ID, date, category, product name, quantity sold, unit price, total price, region, and payment method.

Columns:

Order ID: Unique identifier for each sales order.

Date:Date of the sales transaction.

Category:Broad category of the product sold (e.g., Electronics, Home Appliances, Clothing, Books, Beauty Products, Sports).

Product Name:Specific name or model of the product sold.

Quantity:Number of units of the product sold in the transaction.

Unit Price:Price of one unit of the product.

Total Price: Total revenue generated from the sales transaction (Quantity * Unit Price).

Region:Geographic region where the transaction occurred (e.g., North America, Europe, Asia).

Payment Method: Method used for payment (e.g., Credit Card, PayPal, Debit Card).

Insights:

1. Analyze sales trends over time to identify seasonal patterns or growth opportunities.

2. Explore the popularity of different product categories across regions.

3. Investigate the impact of payment methods on sales volume or revenue.

4. Identify top-selling products within each category to optimize inventory and marketing strategies.

5. Evaluate the performance of specific products or categories in different regions to tailor marketing campaigns accordingly.
n
(Dataset) The most visited health websites in the world
narcis.nl
data.mendeley.com
Updated Jan 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Acosta-Vargas, P (via Mendeley Data) (2021). (Dataset) The most visited health websites in the world [Dataset]. http://doi.org/10.17632/n468trh5my.1
Explore at:
Unique identifier
https://doi.org/10.17632/n468trh5my.1
Dataset updated
Jan 11, 2021
Dataset provided by
Data Archiving and Networked Services (DANS)
Authors
Acosta-Vargas, P (via Mendeley Data)
Description
Evaluation of the most visited health websites in the world
Most visited websites by hierachycal categories
kaggle.com
Updated Sep 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The citation is currently not available for this dataset.
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 18, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Natanael de Souza Figueiredo
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Alexa Internet was founded in April 1996 by Brewster Kahle and Bruce Gilliat. The company's name was chosen in homage to the Library of Alexandria of Ptolemaic Egypt, drawing a parallel between the largest repository of knowledge in the ancient world and the potential of the Internet to become a similar store of knowledge. (from Wikipedia)

The categories list was going out by September, 17h, 2020. So I would like to save it. https://support.alexa.com/hc/en-us/articles/360051913314

This dataset was elaborated by this python script (V2.0): https://github.com/natanael127/dump-alexa-ranking

Content

The sites are grouped in 17 macro categories and this tree ends having more than 360.000 nodes. Subjects are very organized and each of them has its own rank of most accessed domains. So, even the keys of a sub-dictionary may be a good small dataset to use.

Acknowledgements

Thank you my friend André (https://github.com/andrerclaudio) by helping me with tips of Google Colaboratory and computational power to get the data until our deadline.

Inspiration

Alexa ranking was inspired by Library of Alexandria. In the modern world, it may be a good start for AI know more about many, many subjects of the world.
Machine Learning Dataset
brightdata.com
.json, .csv, .xlsx
Updated Jun 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2024). Machine Learning Dataset [Dataset]. https://brightdata.com/products/datasets/machine-learning
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Jun 19, 2024
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.
Z
Dataset used for HTTPS traffic classification using packet burst statistics
data.niaid.nih.gov
Updated Apr 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cejka Tomas (2022). Dataset used for HTTPS traffic classification using packet burst statistics [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4911550
Explore at:
Dataset updated
Apr 11, 2022
Dataset provided by
Hynek Karel
Tropkova Zdena
Cejka Tomas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We are publishing a dataset we created for the HTTPS traffic classification.

Since the data were captured mainly in the real backbone network, we omitted IP addresses and ports. The datasets consist of calculated from bidirectional flows exported with flow probe Ipifixprobe. This exporter can export a sequence of packet lengths and times and a sequence of packet bursts and time. For more information, please visit ipfixprobe repository (Ipifixprobe).

During our research, we divided HTTPS into five categories: L -- Live Video Streaming, P -- Video Player, M -- Music Player, U -- File Upload, D -- File Download, W -- Website, and other traffic.

We have chosen the service representatives known for particular traffic types based on the Alexa Top 1M list and Moz's list of the most popular 500 websites for each category. We also used several popular websites that primarily focus on the audience in our country. The identified traffic classes and their representatives are provided below:

Live Video Stream Twitch, Czech TV, YouTube Live

Video Player DailyMotion, Stream.cz, Vimeo, YouTube

Music Player AppleMusic, Spotify, SoundCloud

File Upload/Download FileSender, OwnCloud, OneDrive, Google Drive

Website and Other Traffic Websites from Alexa Top 1M list
Google Maps Dataset
brightdata.com
.json, .csv, .xlsx
Updated Jan 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2023). Google Maps Dataset [Dataset]. https://brightdata.com/products/datasets/google-maps
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Jan 8, 2023
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
The Google Maps dataset is ideal for getting extensive information on businesses anywhere in the world. Easily filter by location, business type, and other factors to get the exact data you need. The Google Maps dataset includes all major data points: timestamp, name, category, address, description, open website, phone number, open_hours, open_hours_updated, reviews_count, rating, main_image, reviews, url, lat, lon, place_id, country, and more.
G2 Dataset
brightdata.com
.json, .csv, .xlsx
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data, G2 Dataset [Dataset]. https://brightdata.com/products/datasets/g2
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
Use our G2 dataset to collect product descriptions, ratings, reviews, and pricing information from the world's largest tech marketplace. You may purchase a full or partial dataset depending on your business needs. The G2 Software Products Dataset, with a focus on top-rated products, serves as a valuable resource for software buyers, businesses, and technology enthusiasts. This use case highlights products that have received exceptional ratings and positive reviews on the G2 platform, offering insights into customer satisfaction and popularity. For software buyers, this dataset acts as a trusted guide, presenting a curated selection of G2's top-rated software products, ensuring a higher likelihood of satisfaction with purchases. Businesses and technology professionals can leverage this dataset to identify popular and well-reviewed software solutions, optimizing their decision-making process. This use case emphasizes the dataset's utility for those specifically interested in exploring and acquiring top-rated software products from G2's Product Overview The G2 software products and reviews dataset offer a detailed and thorough overview of leading software companies. The dataset includes all major data points: Product descriptions Average rating (1-5) Sellers number of reviews Key features (highest and lowest rated) Competitors Website & social media links and more.
Traces captured by visiting the top 1500 website
kaggle.com
zip
Updated Aug 25, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DNS_dataset (2021). Traces captured by visiting the top 1500 website [Dataset]. https://www.kaggle.com/jacksontang16/traces-captured-by-visiting-the-top-1500-website
Explore at:
zip(5852806 bytes)Available download formats
Dataset updated
Aug 25, 2021
Authors
DNS_dataset
Description
Dataset

This dataset was created by DNS_dataset

Contents
LinkedIn Datasets
brightdata.com
.json, .csv, .xlsx
Updated Dec 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2021). LinkedIn Datasets [Dataset]. https://brightdata.com/products/datasets/linkedin
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Dec 17, 2021
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
Unlock the full potential of LinkedIn data with our extensive dataset that combines profiles, company information, and job listings into one powerful resource for business decision-making, strategic hiring, competitive analysis, and market trend insights. This all-encompassing dataset is ideal for professionals, recruiters, analysts, and marketers aiming to enhance their strategies and operations across various business functions. Dataset Features

Profiles: Dive into detailed public profiles featuring names, titles, positions, experience, education, skills, and more. Utilize this data for talent sourcing, lead generation, and investment signaling, with a refresh rate ensuring up to 30 million records per month. Companies: Access comprehensive company data including ID, country, industry, size, number of followers, website details, subsidiaries, and posts. Tailored subsets by industry or region provide invaluable insights for CRM enrichment, competitive intelligence, and understanding the startup ecosystem, updated monthly with up to 40 million records. Job Listings: Explore current job opportunities detailed with job titles, company names, locations, and employment specifics such as seniority levels and employment functions. This dataset includes direct application links and real-time application numbers, serving as a crucial tool for job seekers and analysts looking to understand industry trends and the job market dynamics.

Customizable Subsets for Specific Needs Our LinkedIn dataset offers the flexibility to tailor the dataset according to your specific business requirements. Whether you need comprehensive insights across all data points or are focused on specific segments like job listings, company profiles, or individual professional details, we can customize the dataset to match your needs. This modular approach ensures that you get only the data that is most relevant to your objectives, maximizing efficiency and relevance in your strategic applications. Popular Use Cases

Strategic Hiring and Recruiting: Track talent movement, identify growth opportunities, and enhance your recruiting efforts with targeted data. Market Analysis and Competitive Intelligence: Gain a competitive edge by analyzing company growth, industry trends, and strategic opportunities. Lead Generation and CRM Enrichment: Enrich your database with up-to-date company and professional data for targeted marketing and sales strategies. Job Market Insights and Trends: Leverage detailed job listings for a nuanced understanding of employment trends and opportunities, facilitating effective job matching and market analysis. AI-Driven Predictive Analytics: Utilize AI algorithms to analyze large datasets for predicting industry shifts, optimizing business operations, and enhancing decision-making processes based on actionable data insights.

Whether you are mapping out competitive landscapes, sourcing new talent, or analyzing job market trends, our LinkedIn dataset provides the tools you need to succeed. Customize your access to fit specific needs, ensuring that you have the most relevant and timely data at your fingertips.
m
Reddit r/AskScience Flair Dataset
data.mendeley.com
Updated May 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumit Mishra (2022). Reddit r/AskScience Flair Dataset [Dataset]. http://doi.org/10.17632/k9r2d9z999.3
Explore at:
Unique identifier
https://doi.org/10.17632/k9r2d9z999.3
Dataset updated
May 23, 2022
Authors
Sumit Mishra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.

The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).

The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.

This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.
Z
Dataset: A Systematic Literature Review on the topic of High-value datasets
data.niaid.nih.gov
Updated Jun 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Magdalena Ciesielska (2023). Dataset: A Systematic Literature Review on the topic of High-value datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7944424
Explore at:
Dataset updated
Jun 23, 2023
Dataset provided by
Magdalena Ciesielska
Nina Rizun
Andrea Miletič
Anastasija Nikiforova
Charalampos Alexopoulos
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains data collected during a study ("Towards High-Value Datasets determination for data-driven development: a systematic literature review") conducted by Anastasija Nikiforova (University of Tartu), Nina Rizun, Magdalena Ciesielska (Gdańsk University of Technology), Charalampos Alexopoulos (University of the Aegean) and Andrea Miletič (University of Zagreb) It being made public both to act as supplementary data for "Towards High-Value Datasets determination for data-driven development: a systematic literature review" paper (pre-print is available in Open Access here -> https://arxiv.org/abs/2305.10234) and in order for other researchers to use these data in their own work.

The protocol is intended for the Systematic Literature review on the topic of High-value Datasets with the aim to gather information on how the topic of High-value datasets (HVD) and their determination has been reflected in the literature over the years and what has been found by these studies to date, incl. the indicators used in them, involved stakeholders, data-related aspects, and frameworks. The data in this dataset were collected in the result of the SLR over Scopus, Web of Science, and Digital Government Research library (DGRL) in 2023.

Methodology

To understand how HVD determination has been reflected in the literature over the years and what has been found by these studies to date, all relevant literature covering this topic has been studied. To this end, the SLR was carried out to by searching digital libraries covered by Scopus, Web of Science (WoS), Digital Government Research library (DGRL).

These databases were queried for keywords ("open data" OR "open government data") AND ("high-value data*" OR "high value data*"), which were applied to the article title, keywords, and abstract to limit the number of papers to those, where these objects were primary research objects rather than mentioned in the body, e.g., as a future work. After deduplication, 11 articles were found unique and were further checked for relevance. As a result, a total of 9 articles were further examined. Each study was independently examined by at least two authors.

To attain the objective of our study, we developed the protocol, where the information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information.

Test procedure Each study was independently examined by at least two authors, where after the in-depth examination of the full-text of the article, the structured protocol has been filled for each study. The structure of the survey is available in the supplementary file available (see Protocol_HVD_SLR.odt, Protocol_HVD_SLR.docx) The data collected for each study by two researchers were then synthesized in one final version by the third researcher.

Description of the data in this data set

Protocol_HVD_SLR provides the structure of the protocol Spreadsheets #1 provides the filled protocol for relevant studies. Spreadsheet#2 provides the list of results after the search over three indexing databases, i.e. before filtering out irrelevant studies

The information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information

Descriptive information
1) Article number - a study number, corresponding to the study number assigned in an Excel worksheet 2) Complete reference - the complete source information to refer to the study 3) Year of publication - the year in which the study was published 4) Journal article / conference paper / book chapter - the type of the paper -{journal article, conference paper, book chapter} 5) DOI / Website- a link to the website where the study can be found 6) Number of citations - the number of citations of the article in Google Scholar, Scopus, Web of Science 7) Availability in OA - availability of an article in the Open Access 8) Keywords - keywords of the paper as indicated by the authors 9) Relevance for this study - what is the relevance level of the article for this study? {high / medium / low}

Approach- and research design-related information 10) Objective / RQ - the research objective / aim, established research questions 11) Research method (including unit of analysis) - the methods used to collect data, including the unit of analy-sis (country, organisation, specific unit that has been ana-lysed, e.g., the number of use-cases, scope of the SLR etc.) 12) Contributions - the contributions of the study 13) Method - whether the study uses a qualitative, quantitative, or mixed methods approach? 14) Availability of the underlying research data- whether there is a reference to the publicly available underly-ing research data e.g., transcriptions of interviews, collected data, or explanation why these data are not shared? 15) Period under investigation - period (or moment) in which the study was conducted 16) Use of theory / theoretical concepts / approaches - does the study mention any theory / theoretical concepts / approaches? If any theory is mentioned, how is theory used in the study?

Quality- and relevance- related information
17) Quality concerns - whether there are any quality concerns (e.g., limited infor-mation about the research methods used)? 18) Primary research object - is the HVD a primary research object in the study? (primary - the paper is focused around the HVD determination, sec-ondary - mentioned but not studied (e.g., as part of discus-sion, future work etc.))

HVD determination-related information
19) HVD definition and type of value - how is the HVD defined in the article and / or any other equivalent term? 20) HVD indicators - what are the indicators to identify HVD? How were they identified? (components & relationships, “input -> output") 21) A framework for HVD determination - is there a framework presented for HVD identification? What components does it consist of and what are the rela-tionships between these components? (detailed description) 22) Stakeholders and their roles - what stakeholders or actors does HVD determination in-volve? What are their roles? 23) Data - what data do HVD cover? 24) Level (if relevant) - what is the level of the HVD determination covered in the article? (e.g., city, regional, national, international)

Format of the file .xls, .csv (for the first spreadsheet only), .odt, .docx

Licenses or restrictions CC-BY

For more info, see README.txt
w
The Pirate Bay 2008-12 Dataset - CSG
data.wu.ac.at
csv
Updated Oct 10, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Global (2013). The Pirate Bay 2008-12 Dataset - CSG [Dataset]. https://data.wu.ac.at/odso/datahub_io/YzZlZDgwM2QtMWMyYS00NmU4LThmNGItZDEzMGYxMGY2MmY4
Explore at:
csvAvailable download formats
Dataset updated
Oct 10, 2013
Dataset provided by
Global
Description
[From website] This dataset was obtained from The Pirate Bay - one of the most popular BitTorrent trackers. The website was crawled and information regarding every published torrent was obtained. The information is useful to elaborate more realistic simulations involving traffic management of peer-to-peer file sharing applications.

This dataset contains the metadata of 679515 unique torrents from The Pirate Bay, collected on December 5th, 2008. The metadata includes for each torrent: ID of the torrent, category, torrent size, number of leechers, number of seeders.

Download: full dataset in CSV format (compressed with tar/bzip2, 5.0MB)

Please cite this dataset as

The Pirate Bay 2008-12 Dataset, Fabio Hecht, Thomas Bocek, David Hausheer, http://www.csg.uzh.ch/publications/data/piratebay/

Openness: OPEN?

License: none specified but implication of availability and statement re. citation implies free usage subject to attribution
Dataset used for detecting DNS over HTTPS by Machine Learning.
zenodo.org
data.niaid.nih.gov
zip
Updated Oct 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dmitrii Vekshin; Karel Hynek; Karel Hynek; Tomas Cejka; Tomas Cejka; Dmitrii Vekshin (2020). Dataset used for detecting DNS over HTTPS by Machine Learning. [Dataset]. http://doi.org/10.5281/zenodo.3906526
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3906526
Dataset updated
Oct 28, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dmitrii Vekshin; Karel Hynek; Karel Hynek; Tomas Cejka; Tomas Cejka; Dmitrii Vekshin
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The dataset consists of three different data sources:

DoH enabled Firefox

DoH enabled Google Chrome

Cloudflared DoH proxy

The capture of web browser data was made using the Selenium framework, which simulated classical user browsing. The browsers received command for visiting domains taken from Alexa's top 10K most visited websites. The capturing was performed on the host by listening to the network interface of the virtual machine. Overall the dataset contains almost 5,000 web-page visits by Mozilla and 1,000 pages visited by Chrome.

The Cloudflared DoH proxy was installed in Raspberry PI, and the IP address of the Raspberry was set as the default DNS resolver in two separate offices in our university. It was continuously capturing the DNS/DoH traffic created up to 20 devices for around three months.

The dataset contains 1,128,904 flows from which is around 33,000 labeled as DoH. We provide raw pcap data, CSV with flow data, and CSV file with extracted features.

The CSV with extracted features has the following data fields:

- Label (1 - Doh, 0 - regular HTTPS)
- Data source
- Duration
- Minimal Inter-Packet Delay
- Maximal Inter-Packet Delay
- Average Inter-Packet Delay
- A variance of Incoming Packet Sizes
- A variance of Outgoing Packet Sizes
- A ratio of the number of Incoming and outgoing bytes
- A ration of the number of Incoming and outgoing packets
- Average of Incoming Packet sizes
- Average of Outgoing Packet sizes
- The median value of Incoming Packet sizes
- The median value of outgoing Packet sizes
- The ratio of bursts and pauses
- Number of bursts
- Number of pauses
- Autocorrelation
- Transmission symmetry in the 1st third of connection
- Transmission symmetry in the 2nd third of connection
- Transmission symmetry in the last third of connection

The observed network traffic does not contain privacy-sensitive information.

The zip file structure is:

|-- data | |-- extracted-features...extracted features used in ML for DoH recognition | | |-- chrome | | |-- cloudflared | | `-- firefox | |-- flows...............................................exported flow data | | |-- chrome | | |-- cloudflared | | `-- firefox | `-- pcaps....................................................raw PCAP data | |-- chrome | |-- cloudflared | `-- firefox |-- LICENSE `-- README.md

When using this dataset, please cite the original work as follows:

@inproceedings{vekshin2020, author = {Vekshin, Dmitrii and Hynek, Karel and Cejka, Tomas}, title = {DoH Insight: Detecting DNS over HTTPS by Machine Learning}, year = {2020}, isbn = {9781450388337}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3407023.3409192}, doi = {10.1145/3407023.3409192}, booktitle = {Proceedings of the 15th International Conference on Availability, Reliability and Security}, articleno = {87}, numpages = {8}, keywords = {classification, DoH, DNS over HTTPS, machine learning, detection, datasets}, location = {Virtual Event, Ireland}, series = {ARES '20} }
Z
PULSE dataset
data.niaid.nih.gov
data.europa.eu
Updated Feb 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexandre Esse (2021). PULSE dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3928561
Explore at:
Dataset updated
Feb 11, 2021
Dataset provided by
Vladimir Urosevic
Alexandre Esse
Description
Motivation

This dataset is derived and cleaned from the full PULSE project dataset to share with others data gathered about the users during the project.

Disclaimer

Any third party need to respect ethics rules and GDPR and must mention “PULSE DATA H2020 - 727816” in any dissemination activities related to data being exploited. Also, you should provide a link to the project associated website: http://www.project-pulse.eu/

The data provided in the files is provided as is. Despite our best efforts at filtering out potential issues, some information could be erroneous.

Description of the dataset

The only difference with the original dataset comes from anonymised user information.

The dataset content is described in a dedicated JSON file:

{ "citizen_id": "pseudonymized unique key of each citizen user in the PULSE system", "city_code": { "description": "3-letter city codes taken by convention from IATA codebook of airports and metropolitan areas, as the codebook of global cities in most common and widespread use and therefore adopted as standard in PULSE (since there is currently - in the year 2020 - still no relevant ISO or other standardized codebook of cities uniformly globally adopted and used). Exception is Pavia which does not have its own airport,and nearby Milan/Bergamo airports are not applicable, so the 'PAI' internal code (not existing in original IATA codes) has been devised in PULSE. For cities with multiple airports, IATA metropolitan area codes are used (New York, Paris).", "BCN": "Barcelona", "BHX": "Birmingham", "NYC": "New York", "PAI": "Pavia", "PAR": "Paris", "SIN": "Singapore", "TPE": "Keelung(Taipei)" }, "zip_code": "Zip or postal code (area) within a city, basic default granular territorial/administrative subdivision unit for localization of citizen users by place of residence (in all PULSE cities)", "models": { "asthma_risk_score": "PULSE asthma risk consensus model score, decimal value ranging from 0 to 1", "asthma_risk_score_category": { "description": "Categorized value of the PULSE asthma risk consensus model score, with the following possible category options:", "low": "low asthma risk, score value below 0,05", "medium-low": "medium-low asthma risk, score value from 0,05 and below 0,1", "medium": "medium asthma risk, score value from 0,1 and below 0,15", "medium-high": "medium-high asthma risk, score value from 0,15 and below 0,2", "high": "high asthma risk, score value from 0,2 and higher" }, "T2D_risk_score": "PULSE diabetes type 2 (T2D) risk consensus model score, decimal value ranging from 0 to 1", "T2D_risk_score_category": { "description": "Categorized value of the PULSE diabetes type 2 risk consensus model score, with the following possible category options:", "low": "low T2D risk, score value below 0,05", "medium-low": "medium-low T2D risk, score value from 0,05 and below 0,1", "medium": "medium T2D risk, score value from 0,1 and below 0,15", "medium-high": "medium-high T2D risk, score value from 0,15 and below 0,2", "high": "high T2D risk, score value from 0,2 and below 0,25", "very_high": "very high T2D risk, score value from 0,25 and higher" }, "well-being_score": "PULSE well-being model score, decimal value ranging from -5 to 5", "well-being_score_category": { "description": "Categorized value of the PULSE well-being model score, with the following possible category options:", "low": "low well-being, score value below -0,37", "medium-low": "medium-low well-being, score value from -0,37 and below 0,04", "medium-high": "medium-high well-being, score value from 0,04 and below 0,36", "high": "high well-being, score value from 0,36 and higher" }, "computed_time": "Timestamp (UTC) when each relevant model score value/result had been computed or derived"
} }
A
Most Viewed Digital Records in City Archives Digital Repository
data.boston.gov
csv
Updated Apr 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archives and Record Management (2019). Most Viewed Digital Records in City Archives Digital Repository [Dataset]. https://data.boston.gov/dataset/most-viewed-digital-records-in-city-archives-digital-repository
Explore at:
csv(7727), csv(7735), csv, csv(7393), csv(7760), csv(7531), csv(7791), csv(7671), csv(7559)Available download formats
Dataset updated
Apr 12, 2019
Dataset authored and provided by
Archives and Record Management
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
Monthly statistics for most viewed digital records in the City Archives Digital Repository.
g
Alexa, International Top 100 Websites, Global, 10.12.2007
geocommons.com
Updated Apr 29, 2008
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexa (2008). Alexa, International Top 100 Websites, Global, 10.12.2007 [Dataset]. http://geocommons.com/search.html
Explore at:
Dataset updated
Apr 29, 2008
Dataset provided by
data
Alexa
Description
This Dataset shows the Alexa Top 100 International Websites, and provides metrics on the volume of traffic that these sites were able to handle. The Alexa top 100 lists the 100 most visited websites in the world and measures various statistical information. I have looked up the Headquarters, either through alexa, or a Whois Lookup to get street address with i was then able to geocode. I was only able to successfully geocode 85 of the top 100 sites throughout the world. Source of Data was Alexa.com, Source URL: http://www.alexa.com/site/ds/top_sites?ts_mode=global&lang=none Data was from October 12, 2007. Alexa is updated daily so to get more up to date information visit their site directly. they don't have maps though.
C
National Hydrography Data - NHD and 3DHP
data.cnra.ca.gov
data.ca.gov
+1more
Updated Jul 16, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
California Department of Water Resources (2025). National Hydrography Data - NHD and 3DHP [Dataset]. https://data.cnra.ca.gov/dataset/national-hydrography-dataset-nhd
Explore at:
website, zip(39288832), pdf(3684753), pdf, pdf(1175775), zip(13901824), pdf(182651), arcgis geoservices rest api, zip(972664), pdf(9867020), csv(12977), zip(4657694), zip(578260992), pdf(1436424), zip(1647291), zip(15824984), pdf(1634485), zip(73817620), zip(10029073), pdf(4856863), zip(128966494), pdf(437025), web videos, pdf(3932070)Available download formats
Dataset updated
Jul 16, 2025
Dataset authored and provided by
California Department of Water Resources
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
The USGS National Hydrography Dataset (NHD) downloadable data collection from The National Map (TNM) is a comprehensive set of digital spatial data that encodes information about naturally occurring and constructed bodies of surface water (lakes, ponds, and reservoirs), paths through which water flows (canals, ditches, streams, and rivers), and related entities such as point features (springs, wells, stream gages, and dams). The information encoded about these features includes classification and other characteristics, delineation, geographic name, position and related measures, a "reach code" through which other information can be related to the NHD, and the direction of water flow. The network of reach codes delineating water and transported material flow allows users to trace movement in upstream and downstream directions. In addition to this geographic information, the dataset contains metadata that supports the exchange of future updates and improvements to the data. The NHD supports many applications, such as making maps, geocoding observations, flow modeling, data maintenance, and stewardship. For additional information on NHD, go to https://www.usgs.gov/core-science-systems/ngp/national-hydrography.

DWR was the steward for NHD and Watershed Boundary Dataset (WBD) in California. We worked with other organizations to edit and improve NHD and WBD, using the business rules for California. California's NHD improvements were sent to USGS for incorporation into the national database. The most up-to-date products are accessible from the USGS website. Please note that the California portion of the National Hydrography Dataset is appropriate for use at the 1:24,000 scale.

For additional derivative products and resources, including the major features in geopackage format, please go to this page: https://data.cnra.ca.gov/dataset/nhd-major-features Archives of previous statewide extracts of the NHD going back to 2018 may be found at https://data.cnra.ca.gov/dataset/nhd-archive.

In September 2022, USGS officially notified DWR that the NHD would become static as USGS resources will be devoted to the transition to the new 3D Hydrography Program (3DHP). 3DHP will consist of LiDAR-derived hydrography at a higher resolution than NHD. Upon completion, 3DHP data will be easier to maintain, based on a modern data model and architecture, and better meet the requirements of users that were documented in the Hydrography Requirements and Benefits Study (2016). The initial releases of 3DHP include NHD data cross-walked into the 3DHP data model. It will take several years for the 3DHP to be built out for California. Please refer to the resources on this page for more information.

The FINAL,STATIC version of the National Hydrography Dataset for California was published for download by USGS on December 27, 2023. This dataset can no longer be edited by the state stewards. The next generation of national hydrography data is the USGS 3D Hydrography Program (3DHP).

Questions about the California stewardship of these datasets may be directed to nhd_stewardship@water.ca.gov.
Success.ai | B2B Company & Contact Data – 28M Verified Company Profiles -...
datarade.ai
Updated Oct 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Success.ai (2024). Success.ai | B2B Company & Contact Data – 28M Verified Company Profiles - Global - Best Price Guarantee & 99% Data Accuracy [Dataset]. https://datarade.ai/data-products/success-ai-b2b-company-contact-data-28m-verified-compan-success-ai
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Oct 15, 2024
Dataset provided by
Area covered
Solomon Islands, United Republic of, Côte d'Ivoire, Niger, Greenland, Burundi, Somalia, Poland, Hungary, India
Description
Success.ai’s Company Data Solutions provide businesses with powerful, enterprise-ready B2B company datasets, enabling you to unlock insights on over 28 million verified company profiles. Our solution is ideal for organizations seeking accurate and detailed B2B contact data, whether you’re targeting large enterprises, mid-sized businesses, or small business contact data.

Success.ai offers B2B marketing data across industries and geographies, tailored to fit your specific business needs. With our white-glove service, you’ll receive curated, ready-to-use company datasets without the hassle of managing data platforms yourself. Whether you’re looking for UK B2B data or global datasets, Success.ai ensures a seamless experience with the most accurate and up-to-date information in the market.

Why Choose Success.ai’s Company Data Solution? At Success.ai, we prioritize quality and relevancy. Every company profile is AI-validated for a 99% accuracy rate and manually reviewed to ensure you're accessing actionable and GDPR-compliant data. Our price match guarantee ensures you receive the best deal on the market, while our white-glove service provides personalized assistance in sourcing and delivering the data you need.

Why Choose Success.ai?

Best Price Guarantee: We offer industry-leading pricing and beat any competitor.

Global Reach: Access over 28 million verified company profiles across 195 countries.

Comprehensive Data: Over 15 data points, including company size, industry, funding, and technologies used.

Accurate & Verified: AI-validated with a 99% accuracy rate, ensuring high-quality data.

Real-Time Updates: Stay ahead with continuously updated company information.

Ethically Sourced Data: Our B2B data is compliant with global privacy laws, ensuring responsible use.

Dedicated Service: Receive personalized, curated data without the hassle of managing platforms.

Tailored Solutions: Custom datasets are built to fit your unique business needs and industries.

Our database spans 195 countries and covers 28 million public and private company profiles, with detailed insights into each company’s structure, size, funding history, and key technologies. We provide B2B company data for businesses of all sizes, from small business contact data to large corporations, with extensive coverage in regions such as North America, Europe, Asia-Pacific, and Latin America.

Comprehensive Data Points: Success.ai delivers in-depth information on each company, with over 15 data points, including:

Company Name: Get the full legal name of the company. LinkedIn URL: Direct link to the company's LinkedIn profile. Company Domain: Website URL for more detailed research. Company Description: Overview of the company’s services and products. Company Location: Geographic location down to the city, state, and country. Company Industry: The sector or industry the company operates in. Employee Count: Number of employees to help identify company size. Technologies Used: Insights into key technologies employed by the company, valuable for tech-based outreach. Funding Information: Track total funding and the most recent funding dates for investment opportunities. Maximize Your Sales Potential: With Success.ai’s B2B contact data and company datasets, sales teams can build tailored lists of target accounts, identify decision-makers, and access real-time company intelligence. Our curated datasets ensure you’re always focused on high-value leads—those who are most likely to convert into clients. Whether you’re conducting account-based marketing (ABM), expanding your sales pipeline, or looking to improve your lead generation strategies, Success.ai offers the resources you need to scale your business efficiently.

Tailored for Your Industry: Success.ai serves multiple industries, including technology, healthcare, finance, manufacturing, and more. Our B2B marketing data solutions are particularly valuable for businesses looking to reach professionals in key sectors. You’ll also have access to small business contact data, perfect for reaching new markets or uncovering high-growth startups.

From UK B2B data to contacts across Europe and Asia, our datasets provide global coverage to expand your business reach and identify new markets. With continuous data updates, Success.ai ensures you’re always working with the freshest information.

Key Use Cases:

Targeted Lead Generation: Build accurate lead lists by filtering data by company size, industry, or location. Target decision-makers in key industries to streamline your B2B sales outreach.

Account-Based Marketing (ABM): Use B2B company data to personalize marketing campaigns, focusing on high-value accounts and improving conversion rates.

Investment Research: Track company growth, funding rounds, and employee trends to identify investment opportunities or potential M&A targets.

Market Research: Enrich your market intelligence initiatives by gain...
Network Traffic Dataset
kaggle.com
Updated Oct 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ravikumar Gattu (2023). Network Traffic Dataset [Dataset]. https://www.kaggle.com/datasets/ravikumargattu/network-traffic-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 31, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ravikumar Gattu
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The data presented here was obtained in a Kali Machine from University of Cincinnati,Cincinnati,OHIO by carrying out packet captures for 1 hour during the evening on Oct 9th,2023 using Wireshark.This dataset consists of 394137 instances were obtained and stored in a CSV (Comma Separated Values) file.This large dataset could be used utilised for different machine learning applications for instance classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.

The dataset can be used for a variety of machine learning tasks, such as network intrusion detection, traffic classification, and anomaly detection.

Content :

This network traffic dataset consists of 7 features.Each instance contains the information of source and destination IP addresses, The majority of the properties are numeric in nature, however there are also nominal and date kinds due to the Timestamp.

The network traffic flow statistics (No. Time Source Destination Protocol Length Info) were obtained using Wireshark (https://www.wireshark.org/).

Dataset Columns:

No : Number of Instance. Timestamp : Timestamp of instance of network traffic Source IP: IP address of Source Destination IP: IP address of Destination Portocol: Protocol used by the instance Length: Length of Instance Info: Information of Traffic Instance

Acknowledgements :

I would like thank University of Cincinnati for giving the infrastructure for generation of network traffic data set.

Ravikumar Gattu , Susmitha Choppadandi

Inspiration : This dataset goes beyond the majority of network traffic classification datasets, which only identify the type of application (WWW, DNS, ICMP,ARP,RARP) that an IP flow contains. Instead, it generates machine learning models that can identify specific applications (like Tiktok,Wikipedia,Instagram,Youtube,Websites,Blogs etc.) from IP flow statistics (there are currently 25 applications in total).

**Dataset License: ** CC0: Public Domain

Dataset Usages : This dataset can be used for different machine learning applications in the field of cybersecurity such as classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.

ML techniques benefits from this Dataset :

This dataset is highly useful because it consists of 394137 instances of network traffic data obtained by using the 25 applications on a public,private and Enterprise networks.Also,the dataset consists of very important features that can be used for most of the applications of Machine learning in cybersecurity.Here are few of the potential machine learning applications that could be benefited from this dataset are :

Network Performance Monitoring : This large network traffic data set can be utilised for analysing the network traffic to identifying the network patterns in the network .This help in designing the network security algorithms for minimise the network probelms.

Anamoly Detection : Large network traffic dataset can be utilised training the machine learning models for finding the irregularitues in the traffic which could help identify the cyber attacks.

3.Network Intrusion Detection : This large dataset could be utilised for machine algorithms training and designing the models for detection of the traffic issues,Malicious traffic network attacks and DOS attacks as well.

Facebook

Twitter

Click to copy link

Link copied

Cite

(2017). Most popular websites in the Netherlands 2015 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/3edeb59b-b49b-59cb-9757-9127aed7e8af

Most popular websites in the Netherlands 2015 - Dataset - B2FIND

Explore at:

Dataset updated

Jun 2, 2017

Area covered

Netherlands

Description

This dataset contains a list of 3654 Dutch websites that we considered the most popular websites in 2015. This list served as whitelist for the Newstracker Research project in which we monitored the online web behaviour of a group of respondents.The research project 'The Newstracker' was a subproject of the NWO-funded project 'The New News Consumer: A User-Based Innovation Project to Meet Paradigmatic Change in News Use and Media Habits'.For the Newstracker project we aimed to understand the web behaviour of a group of respondents. We created custom-built software to monitor their web browsing behaviour on their laptops and desktops (please find the code in open access at https://github.com/NITechLabs/NewsTracker). For reasons of scale and privacy we created a whitelist with websites that were the most popular websites in 2015. We manually compiled this list by using data of DDMM, Alexa and own research. The dataset consists of 5 columns:- the URL- the type of website: We created a list of types of websites and each website has been manually labeled with 1 category- Nieuws-regio: When the category was 'News', we subdivided these websites in the regional focus: International, National or Local- Nieuws-onderwerp: Furthermore, each website under the category News was further subdivided in type of news website. For this we created an own list of news categories and manually coded each website- Bron: For each website we noted which source we used to find this website.The full description of the research design of the Newstracker including the set-up of this whitelist is included in the following article: Kleppe, M., Otte, M. (in print), 'Analysing & understanding news consumption patterns by tracking online user behaviour with a multimodal research design', Digital Scholarship in the Humanities, doi 10.1093/llc/fqx030.

Clear search

Close search

Google apps

Main menu

Most popular websites in the Netherlands 2015 - Dataset - B2FIND

Online Sales Dataset - Popular Marketplace Data

Columns:

Insights:

(Dataset) The most visited health websites in the world

Most visited websites by hierachycal categories

Context

Content

Acknowledgements

Inspiration

Machine Learning Dataset

Dataset used for HTTPS traffic classification using packet burst statistics

Google Maps Dataset

G2 Dataset

Traces captured by visiting the top 1500 website

Dataset

Contents

LinkedIn Datasets

Reddit r/AskScience Flair Dataset

Dataset: A Systematic Literature Review on the topic of High-value datasets

The Pirate Bay 2008-12 Dataset - CSG

Openness: OPEN?

Dataset used for detecting DNS over HTTPS by Machine Learning.

PULSE dataset

Most Viewed Digital Records in City Archives Digital Repository

Alexa, International Top 100 Websites, Global, 10.12.2007

National Hydrography Data - NHD and 3DHP

Success.ai | B2B Company & Contact Data – 28M Verified Company Profiles -...

Network Traffic Dataset

Most popular websites in the Netherlands 2015 - Dataset - B2FINDSee More Versions

Most popular websites in the Netherlands 2015 - Dataset - B2FIND