Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Popular Website Traffic Over Time ’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/popular-website-traffice on 13 February 2022.
--- Dataset description provided by original source is as follows ---
Background
Have you every been in a conversation and the question comes up, who uses Bing? This question comes up occasionally because people wonder if these sites have any views. For this research study, we are going to be exploring popular website traffic for many popular websites.
Methodology
The data collected originates from SimilarWeb.com.
Source
For the analysis and study, go to The Concept Center
This dataset was created by Chase Willden and contains around 0 samples along with 1/1/2017, Social Media, technical information and other features such as: - 12/1/2016 - 3/1/2017 - and more.
- Analyze 11/1/2016 in relation to 2/1/2017
- Study the influence of 4/1/2017 on 1/1/2017
- More datasets
If you use this dataset in your research, please credit Chase Willden
--- Original source retains full ownership of the source dataset ---
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Dataset Card for 1000 Website Screenshots with Metadata
Dataset Summary
Silatus is sharing, for free, a segment of a dataset that we are using to train a generative AI model for text-to-mockup conversions. This dataset was collected in December 2022 and early January 2023, so it contains very recent data from 1,000 of the world's most popular websites. You can get our larger 10,000 website dataset for free at: https://silatus.com/datasets This dataset includes: High-res… See the full description on the dataset page: https://huggingface.co/datasets/silatus/1k_Website_Screenshots_and_Metadata.
This dataset contains a list of 3654 Dutch websites that we considered the most popular websites in 2015. This list served as whitelist for the Newstracker Research project in which we monitored the online web behaviour of a group of respondents.The research project 'The Newstracker' was a subproject of the NWO-funded project 'The New News Consumer: A User-Based Innovation Project to Meet Paradigmatic Change in News Use and Media Habits'.For the Newstracker project we aimed to understand the web behaviour of a group of respondents. We created custom-built software to monitor their web browsing behaviour on their laptops and desktops (please find the code in open access at https://github.com/NITechLabs/NewsTracker). For reasons of scale and privacy we created a whitelist with websites that were the most popular websites in 2015. We manually compiled this list by using data of DDMM, Alexa and own research. The dataset consists of 5 columns:- the URL- the type of website: We created a list of types of websites and each website has been manually labeled with 1 category- Nieuws-regio: When the category was 'News', we subdivided these websites in the regional focus: International, National or Local- Nieuws-onderwerp: Furthermore, each website under the category News was further subdivided in type of news website. For this we created an own list of news categories and manually coded each website- Bron: For each website we noted which source we used to find this website.The full description of the research design of the Newstracker including the set-up of this whitelist is included in the following article: Kleppe, M., Otte, M. (in print), 'Analysing & understanding news consumption patterns by tracking online user behaviour with a multimodal research design', Digital Scholarship in the Humanities, doi 10.1093/llc/fqx030.
The "Phishing Data" dataset is a comprehensive collection of information specifically curated for analyzing and understanding phishing attacks. Phishing attacks involve malicious attempts to deceive individuals or organizations into disclosing sensitive information such as passwords or credit card details. This dataset comprises 18 distinct features that offer valuable insights into the characteristics of phishing attempts. These features include the URL of the website being analyzed, the length of the URL, the use of URL shortening services, the presence of the "@" symbol, the presence of redirection using "//", the presence of prefixes or suffixes in the URL, the number of subdomains, the usage of secure connection protocols (HTTPS), the length of time since domain registration, the presence of a favicon, the presence of HTTP or HTTPS tokens in the domain name, the URL of requested external resources, the presence of anchors in the URL, the number of hyperlinks in HTML tags, the server form handler used, the submission of data to email addresses, abnormal URL patterns, and estimated website traffic or popularity. Together, these features enable the analysis and detection of phishing attempts in the "Phishing Data" dataset, aiding in the development of models and algorithms to combat phishing attacks.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a dataset used for the online stats training website (https://www.rensvandeschoot.com/tutorials/) and is based on the data used by van de Schoot, van der Velden, Boom, and Brugman (2010).
The dataset is based on a study that investigates an association between popularity status and antisocial behavior from at-risk adolescents (n = 1491), where gender and ethnic background are moderators under the association. The study distinguished subgroups within the popular status group in terms of overt and covert antisocial behavior.For more information on the sample, instruments, methodology, and research context, we refer the interested readers to van de Schoot, van der Velden, Boom, and Brugman (2010).
Variable name Description
Respnr = Respondents’ number
Dutch = Respondents’ ethnic background (0 = Dutch origin, 1 = non-Dutch origin)
gender = Respondents’ gender (0 = boys, 1 = girls)
sd = Adolescents’ socially desirable answering patterns
covert = Covert antisocial behavior
overt = Overt antisocial behavior
We'll extract any data from any website on the Internet. You don't have to worry about buying and maintaining complex and expensive software, or hiring developers.
Some common use cases our customers use the data for: • Data Analysis • Market Research • Price Monitoring • Sales Leads • Competitor Analysis • Recruitment
We can get data from websites with pagination or scroll, with captchas, and even from behind logins. Text, images, videos, documents.
Receive data in any format you need: Excel, CSV, JSON, or any other.
https://brightdata.com/licensehttps://brightdata.com/license
Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is available on Brisbane City Council’s open data website – data.brisbane.qld.gov.au. The site provides additional features for viewing and interacting with the data and for downloading the data in various formats.
Monthly analytics reports for the Brisbane City Council website
Information regarding the sessions for Brisbane City Council website during the month including page views and unique page views.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The Roboflow Website Screenshots
dataset is a synthetically generated dataset composed of screenshots from over 1000 of the world's top websites. They have been automatically annotated to label the following classes:
:fa-spacer:
* button
- navigation links, tabs, etc.
* heading
- text that was enclosed in <h1>
to <h6>
tags.
* link
- inline, textual <a>
tags.
* label
- text labeling form fields.
* text
- all other text.
* image
- <img>
, <svg>
, or <video>
tags, and icons.
* iframe
- ads and 3rd party content.
This is an example image and annotation from the dataset:
https://i.imgur.com/mOG3u3Z.png" alt="WIkipedia Screenshot">
Annotated screenshots are very useful in Robotic Process Automation. But they can be expensive to label. This dataset would cost over $4000 for humans to label on popular labeling services. We hope this dataset provides a good starting point for your project. Try it with a model from our model library.
Roboflow is happy to provide a custom screenshots dataset to meet your particular needs. We can crawl public or internal web applications. Just reach out and we'll be happy to provide a quote!
Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless. :fa-spacer: Developers reduce 50% of their boilerplate code when using Roboflow's workflow, save training time, and increase model reproducibility. :fa-spacer:
https://brightdata.com/licensehttps://brightdata.com/license
Use our G2 dataset to collect product descriptions, ratings, reviews, and pricing information from the world's largest tech marketplace. You may purchase a full or partial dataset depending on your business needs. The G2 Software Products Dataset, with a focus on top-rated products, serves as a valuable resource for software buyers, businesses, and technology enthusiasts. This use case highlights products that have received exceptional ratings and positive reviews on the G2 platform, offering insights into customer satisfaction and popularity. For software buyers, this dataset acts as a trusted guide, presenting a curated selection of G2's top-rated software products, ensuring a higher likelihood of satisfaction with purchases. Businesses and technology professionals can leverage this dataset to identify popular and well-reviewed software solutions, optimizing their decision-making process. This use case emphasizes the dataset's utility for those specifically interested in exploring and acquiring top-rated software products from G2's Product Overview The G2 software products and reviews dataset offer a detailed and thorough overview of leading software companies. The dataset includes all major data points: Product descriptions Average rating (1-5) Sellers number of reviews Key features (highest and lowest rated) Competitors Website & social media links and more.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset comprises 2,271 entries and provides insights into user interface (UI) and user experience (UX) preferences across various digital platforms. Key information includes user demographics (Name, Age, Gender) and platform preferences (e.g., Twitter, YouTube, Facebook, Website). It captures user experiences and satisfaction levels with various UI/UX elements such as color schemes, visual hierarchy, typography, multimedia usage, and layout design. The dataset also includes evaluations of mobile responsiveness, call-to-action buttons, form usability, feedback/error messages, loading speed, personalization, accessibility, and interactions (like scrolling behavior and gestures). Each UI/UX component is rated on a scale, allowing for quantitative analysis of user preferences and experiences, making this dataset valuable for research in user-centered design and usability optimization.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The people from Czech are publishing a dataset for the HTTPS traffic classification.
Since the data were captured mainly in the real backbone network, they omitted IP addresses and ports. The datasets consist of calculated from bidirectional flows exported with flow probe Ipifixprobe. This exporter can export a sequence of packet lengths and times and a sequence of packet bursts and time. For more information, please visit ipfixprobe repository (Ipifixprobe).
During research, they divided HTTPS into five categories: L -- Live Video Streaming, P -- Video Player, M -- Music Player, U -- File Upload, D -- File Download, W -- Website, and other traffic.
They have chosen the service representatives known for particular traffic types based on the Alexa Top 1M list and Moz's list of the most popular 500 websites for each category. They also used several popular websites that primarily focus on the audience in Czech. The identified traffic classes and their representatives are provided below:
Live Video Stream Twitch, Czech TV, YouTube Live Video Player DailyMotion, Stream.cz, Vimeo, YouTube Music Player AppleMusic, Spotify, SoundCloud File Upload/Download FileSender, OwnCloud, OneDrive, Google Drive Website and Other Traffic Websites from Alexa Top 1M list
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We are publishing a dataset we created for the HTTPS traffic classification.
Since the data were captured mainly in the real backbone network, we omitted IP addresses and ports. The datasets consist of calculated from bidirectional flows exported with flow probe Ipifixprobe. This exporter can export a sequence of packet lengths and times and a sequence of packet bursts and time. For more information, please visit ipfixprobe repository (Ipifixprobe).
During our research, we divided HTTPS into five categories: L -- Live Video Streaming, P -- Video Player, M -- Music Player, U -- File Upload, D -- File Download, W -- Website, and other traffic.
We have chosen the service representatives known for particular traffic types based on the Alexa Top 1M list and Moz's list of the most popular 500 websites for each category. We also used several popular websites that primarily focus on the audience in our country. The identified traffic classes and their representatives are provided below:
Live Video Stream Twitch, Czech TV, YouTube Live
Video Player DailyMotion, Stream.cz, Vimeo, YouTube
Music Player AppleMusic, Spotify, SoundCloud
File Upload/Download FileSender, OwnCloud, OneDrive, Google Drive
Website and Other Traffic Websites from Alexa Top 1M list
The DAPR extension for CKAN integrates with the Digital Analytics Program (DAP) to retrieve and display usage statistics for datasets and resources within a CKAN instance. This extension allows administrators to monitor how frequently datasets are accessed, providing valuable insights into data usage patterns. By tracking download events, DAPR enriches CKAN's functionality, facilitating data-driven decision-making and resource management. Key Features: DAP Download Event Retrieval: Retrieves download events tracked by the Digital Analytics Program and stores them within CKAN for later analysis and display. This ensures that access data is captured and made available alongside other dataset metadata. Frequently Accessed Dataset Listing: Enables the creation of lists showcasing the most frequently accessed datasets. This allows administrators to identify popular datasets and prioritize resources accordingly. Dataset and Resource Access Counting: Provides a mechanism to display access counts for datasets and individual resources. These counts can be displayed within the CKAN interface, providing users with immediate feedback on dataset popularity. DAP-enabled Website Event Tracking: Tracks accesses to resources even when those accesses originate from external DAP-enabled websites, providing a comprehensive view of data usage regardless of the access point. Scheduled Data Refresh: Supports command-line utility for scheduled access data imports, ensuring usage statistics remain up-to-date with minimal manual intervention. It has options to specify the start and end date or retrieve all records. Technical Integration: The DAPR extension integrates with CKAN by adding new database tables to store DAP tracking data. Administrators configure the extension through the CKAN configuration file (ckan.ini) or similar. The extension also provides a command-line interface (CLI) tool for importing DAP tracking events, which can be scheduled using cron or similar task schedulers. Benefits & Impact: By integrating DAP statistics, the DAPR extension allows CKAN instance owners to improve data visibility, assess the value of available datasets, and make data-informed decisions about resource allocation and data discoverability improvements. Knowing which datasets are frequently used and accessed can help data curators prioritize updates, augment popular datasets with further information, and overall invest resources into ensuring that the CKAN instance delivers the most relevant and impactful data to its audience.
A. Market Research and Analysis: Utilize the Tripadvisor dataset to conduct in-depth market research and analysis in the travel and hospitality industry. Identify emerging trends, popular destinations, and customer preferences. Gain a competitive edge by understanding your target audience's needs and expectations.
B. Competitor Analysis: Compare and contrast your hotel or travel services with competitors on Tripadvisor. Analyze their ratings, customer reviews, and performance metrics to identify strengths and weaknesses. Use these insights to enhance your offerings and stand out in the market.
C. Reputation Management: Monitor and manage your hotel's online reputation effectively. Track and analyze customer reviews and ratings on Tripadvisor to identify improvement areas and promptly address negative feedback. Positive reviews can be leveraged for marketing and branding purposes.
D. Pricing and Revenue Optimization: Leverage the Tripadvisor dataset to analyze pricing strategies and revenue trends in the hospitality sector. Understand seasonal demand fluctuations, pricing patterns, and revenue optimization opportunities to maximize your hotel's profitability.
E. Customer Sentiment Analysis: Conduct sentiment analysis on Tripadvisor reviews to gauge customer satisfaction and sentiment towards your hotel or travel service. Use this information to improve guest experiences, address pain points, and enhance overall customer satisfaction.
F. Content Marketing and SEO: Create compelling content for your hotel or travel website based on the popular keywords, topics, and interests identified in the Tripadvisor dataset. Optimize your content to improve search engine rankings and attract more potential guests.
G. Personalized Marketing Campaigns: Use the data to segment your target audience based on preferences, travel habits, and demographics. Develop personalized marketing campaigns that resonate with different customer segments, resulting in higher engagement and conversions.
H. Investment and Expansion Decisions: Access historical and real-time data on hotel performance and market dynamics from Tripadvisor. Utilize this information to make data-driven investment decisions, identify potential areas for expansion, and assess the feasibility of new ventures.
I. Predictive Analytics: Utilize the dataset to build predictive models that forecast future trends in the travel industry. Anticipate demand fluctuations, understand customer behavior, and make proactive decisions to stay ahead of the competition.
J. Business Intelligence Dashboards: Create interactive and insightful dashboards that visualize key performance metrics from the Tripadvisor dataset. These dashboards can help executives and stakeholders get a quick overview of the hotel's performance and make data-driven decisions.
Incorporating the Tripadvisor dataset into your business processes will enhance your understanding of the travel market, facilitate data-driven decision-making, and provide valuable insights to drive success in the competitive hospitality industry
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
🍷 FineWeb
15 trillion tokens of the finest data the 🌐 web has to offer
What is it?
The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.
https://brightdata.com/licensehttps://brightdata.com/license
Real estate datasets from various websites cover all major real estate data points including: property type, size, location, price, bedrooms, baths, address, history, images, and much more. Popular use cases include: forecast housing demand, analyze price fluctuations, improve customer satisfaction, see past prices to monitor market trends, and more.
The review corpus used here consists of a collection of Goodreads book reviews obtained from the Kaggle website. Originally, it consists of around 10,000 reviews written towards top 100 science fiction books (ranked based on Goodreads ratings). However, upon examination, we find that a significant number of reviews comprise various types of issues, including missing values for the review text and like count, as well as duplicate sentences. We remove duplicate sentences from the reviews and exclude problematic reviews with missing fields. After clean-up, the corpus consists of 2259 popular and 2555 non-popular reviews.
https://brightdata.com/licensehttps://brightdata.com/license
Unlock the full potential of LinkedIn data with our extensive dataset that combines profiles, company information, and job listings into one powerful resource for business decision-making, strategic hiring, competitive analysis, and market trend insights. This all-encompassing dataset is ideal for professionals, recruiters, analysts, and marketers aiming to enhance their strategies and operations across various business functions. Dataset Features
Profiles: Dive into detailed public profiles featuring names, titles, positions, experience, education, skills, and more. Utilize this data for talent sourcing, lead generation, and investment signaling, with a refresh rate ensuring up to 30 million records per month. Companies: Access comprehensive company data including ID, country, industry, size, number of followers, website details, subsidiaries, and posts. Tailored subsets by industry or region provide invaluable insights for CRM enrichment, competitive intelligence, and understanding the startup ecosystem, updated monthly with up to 40 million records. Job Listings: Explore current job opportunities detailed with job titles, company names, locations, and employment specifics such as seniority levels and employment functions. This dataset includes direct application links and real-time application numbers, serving as a crucial tool for job seekers and analysts looking to understand industry trends and the job market dynamics.
Customizable Subsets for Specific Needs Our LinkedIn dataset offers the flexibility to tailor the dataset according to your specific business requirements. Whether you need comprehensive insights across all data points or are focused on specific segments like job listings, company profiles, or individual professional details, we can customize the dataset to match your needs. This modular approach ensures that you get only the data that is most relevant to your objectives, maximizing efficiency and relevance in your strategic applications. Popular Use Cases
Strategic Hiring and Recruiting: Track talent movement, identify growth opportunities, and enhance your recruiting efforts with targeted data. Market Analysis and Competitive Intelligence: Gain a competitive edge by analyzing company growth, industry trends, and strategic opportunities. Lead Generation and CRM Enrichment: Enrich your database with up-to-date company and professional data for targeted marketing and sales strategies. Job Market Insights and Trends: Leverage detailed job listings for a nuanced understanding of employment trends and opportunities, facilitating effective job matching and market analysis. AI-Driven Predictive Analytics: Utilize AI algorithms to analyze large datasets for predicting industry shifts, optimizing business operations, and enhancing decision-making processes based on actionable data insights.
Whether you are mapping out competitive landscapes, sourcing new talent, or analyzing job market trends, our LinkedIn dataset provides the tools you need to succeed. Customize your access to fit specific needs, ensuring that you have the most relevant and timely data at your fingertips.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains two files created for the dissertation "A Social Media Tool for Domain-Specific Information Retrieval - A Case Study in Human Trafficking" by Tito Griné for the Master in Informatics and Computing Engineering from the Faculty of Engineering of the University of Porto (FEUP). Both files were built in the period between the 02/03/2022 and 09/03/2022. The file, "Topic profile dataset", includes Twitter profiles, identified by their handle, associated with a topic to which they are highly related. These were gathered by first selecting specific topics and finding lists of famous people within them. Afterward, the Twitter API was used to search for profiles using the names from the lists. The first profile returned for each search was manually analyzed to corroborate the relation to the topic and keep it. This dataset was used to evaluate the performance of an agnostic classifier designed to identify Twitter profiles related to a given topic. The topic was given as a set of keywords that were highly related to the desired topic. The file contains 271 pairs of topics and Twitter profile handles. There are profiles spanning six different topics: Ambient Music (102 profiles); Climate Activism (69 profiles); Quantum Information (9 profiles); Contemporary Art (26 profiles); Tennis (52 profiles); and Information Retrieval (13 profiles). At the time this dataset was created, all Twitter handles were from publicly visible profiles. The file, "Profile-website dataset", includes Twitter profiles, identified by their handle, linked to URLs of websites related to the entities behind the profiles. The starting list of Twitter handles was taken from the profiles of the "topic-profile_dataset.csv". The links in each profile's description were gathered using the Twitter API, and each was manually crawled to assess its relatedness to the profile from which it was taken. This dataset helped evaluate the efficacy of an algorithm developed to classify websites as related or unrelated to a given Twitter profile. From the initial list of 271 profiles, at least one related link was found for 196 of them. The remaining 75 were not included in this dataset. Hence, the dataset contains 196 unique Twitter handles, with 325 distinct pairs of Twitter handles and corresponding URLs since some Twitter handles appear in more than one row when it is the case that multiple URLs are related.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Popular Website Traffic Over Time ’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/popular-website-traffice on 13 February 2022.
--- Dataset description provided by original source is as follows ---
Background
Have you every been in a conversation and the question comes up, who uses Bing? This question comes up occasionally because people wonder if these sites have any views. For this research study, we are going to be exploring popular website traffic for many popular websites.
Methodology
The data collected originates from SimilarWeb.com.
Source
For the analysis and study, go to The Concept Center
This dataset was created by Chase Willden and contains around 0 samples along with 1/1/2017, Social Media, technical information and other features such as: - 12/1/2016 - 3/1/2017 - and more.
- Analyze 11/1/2016 in relation to 2/1/2017
- Study the influence of 4/1/2017 on 1/1/2017
- More datasets
If you use this dataset in your research, please credit Chase Willden
--- Original source retains full ownership of the source dataset ---