100+ datasets found

Popular Website Screenshots and Metadata
kaggle.com
zip
Updated Jan 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christopher Pratt (2023). Popular Website Screenshots and Metadata [Dataset]. https://www.kaggle.com/datasets/christopherpratt/popular-website-screenshots-and-metadata
Explore at:
zip(1273641347 bytes)Available download formats
Dataset updated
Jan 6, 2023
Authors
Christopher Pratt
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Silatus is sharing, for free, a segment of a dataset that we are using to train a generative AI model for text-to-mockup conversions. This dataset was collected in December 2022 and early January 2023, so it contains very recent data from 1,000 of the world's most popular websites. You can get our larger 10,000 website dataset for free at: https://silatus.com/datasets

This dataset includes:

High-res screenshots

1024x1024px

Loaded Javascript

Loaded Images

Text metadata

Site title

Navbar content

Full page text data

Page description

Visual metadata

Content (images, videos, inputs, buttons) absolute & relative positions

Color profile

Base font
R
Popular Websites (edited) Dataset
universe.roboflow.com
zip
Updated Aug 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bro (2025). Popular Websites (edited) Dataset [Dataset]. https://universe.roboflow.com/bro-klhic/popular-websites-edited
Explore at:
zipAvailable download formats
Dataset updated
Aug 2, 2025
Dataset authored and provided by
Bro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Popular Webpages Bounding Boxes
Description
Popular Websites (edited)

## Overview Popular Websites (edited) is a dataset for object detection tasks - it contains Popular Webpages annotations for 552 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Popular websites across the globe
kaggle.com
zip
Updated May 27, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
bpali26 (2017). Popular websites across the globe [Dataset]. https://www.kaggle.com/bpali26/popular-websites-across-the-globe
Explore at:
zip(639485 bytes)Available download formats
Dataset updated
May 27, 2017
Authors
bpali26
Description
Context

This dataset includes some of the basic information of the websites we daily use. While scrapping this info, I learned quite a lot in R programming, system speed, memory usage etc. and developed my niche in Web Scrapping. It took about 4-5 hrs for scrapping this data through my system (4GB RAM) and nearly about 4-5 days working out my idea through this project.

Content

The dataset contains Top 50 ranked sites from each 191 countries along with their traffic (global) rank. Here, country_rank represent the traffic rank of that site within the country, and traffic_rank represent the global traffic rank of that site.

Since most of the columns meaning can be derived from their name itself, its pretty much straight forward to understand this dataset. However, there are some instances of confusion which I would like to explain in here:

1) most of the numeric values are in character format, hence, contain spaces which you might need to clean on.

2) There are multiple instances of same website. for.e.g. Yahoo. com is present in 179 rows within this dataset. This is due to their different country rank in each country.

3)The information provided in this dataset is for the top 50 websites in 191 countries as on 25th May 2017 and is subjected to change in future time due to the dynamic structure of ranking.

4) The dataset inactual contains 9540 rows instead of 9550(50*191 rows). This was due to the unavailability of information for 10 websites.

PS: in case if there are anymore queries, comment on this, I'll add an answer to that in above list.

Acknowledgements

I wouldn't have done this without the help of others. I've scrapped this information from publicly available (open to all) websites namely: 1) http://data.danetsoft.com/ 2) http://www.alexa.com/topsites , of which i'm highly grateful. I truly appreciate and thanks the owner of these sites for providing us with the information that I included today in this dataset.

Inspiration

I feel that there this a lot of scope for exploring & visualization this dataset to find out the trends in the attributes of these websites across countries. Also, one could try predicting the traffic(global) rank being a dependent factor on the other attributes of the website. In any case, this dataset will help you find out the popular sites in your area.
R
Popular Websites (augmented + Nonedited) Dataset
universe.roboflow.com
zip
Updated Aug 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bro (2025). Popular Websites (augmented + Nonedited) Dataset [Dataset]. https://universe.roboflow.com/bro-klhic/popular-websites-augmented-nonedited/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Aug 2, 2025
Dataset authored and provided by
Bro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Ui Elements Bounding Boxes
Description
Popular Websites (augmented + Nonedited)

## Overview Popular Websites (augmented + Nonedited) is a dataset for object detection tasks - it contains Ui Elements annotations for 602 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Dataset Search WebApp
figshare.com
zip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Angelo Batista Neves Júnior; Luiz André Portes Paes Leme (2023). Dataset Search WebApp [Dataset]. http://doi.org/10.6084/m9.figshare.5217958.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5217958.v2
Dataset updated
May 31, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Angelo Batista Neves Júnior; Luiz André Portes Paes Leme
License
https://www.gnu.org/copyleft/gpl.htmlhttps://www.gnu.org/copyleft/gpl.html
Description
Despite the fact that extensive list of open datasets are available in catalogues, most of the data publishers still connects their datasets to other popular datasets, such as DBpedia5, Freebase 6 and Geonames7. Although the linkage with popular datasets would allow us to explore external resources, it would fail to cover highly specialized information. Catalogues of linked data describe the content of datasets in terms of the update periodicity, authors, SPARQL endpoints, linksets with other datasets, amongst others, as recommended by W3C VoID Vocabulary. However, catalogues by themselves do not provide any explicit information to help the URI linkage process.Searching techniques can rank available datasets SI according to the probability that it will be possible to define links between URIs of SI and a given dataset T to be published, so that most of the links, if not all, could be found by inspecting the most relevant datasets in the ranking. dataset-search is a tool for searching datasets for linkage.
Dataset used for HTTPS traffic classification using packet burst statistics
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Apr 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tropkova Zdena; Hynek Karel; Cejka Tomas (2022). Dataset used for HTTPS traffic classification using packet burst statistics [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4911550
Explore at:
Dataset updated
Apr 11, 2022
Dataset provided by
CESNEThttp://www.cesnet.cz/
FIT CTU
Authors
Tropkova Zdena; Hynek Karel; Cejka Tomas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We are publishing a dataset we created for the HTTPS traffic classification.

Since the data were captured mainly in the real backbone network, we omitted IP addresses and ports. The datasets consist of calculated from bidirectional flows exported with flow probe Ipifixprobe. This exporter can export a sequence of packet lengths and times and a sequence of packet bursts and time. For more information, please visit ipfixprobe repository (Ipifixprobe).

During our research, we divided HTTPS into five categories: L -- Live Video Streaming, P -- Video Player, M -- Music Player, U -- File Upload, D -- File Download, W -- Website, and other traffic.

We have chosen the service representatives known for particular traffic types based on the Alexa Top 1M list and Moz's list of the most popular 500 websites for each category. We also used several popular websites that primarily focus on the audience in our country. The identified traffic classes and their representatives are provided below:

Live Video Stream Twitch, Czech TV, YouTube Live

Video Player DailyMotion, Stream.cz, Vimeo, YouTube

Music Player AppleMusic, Spotify, SoundCloud

File Upload/Download FileSender, OwnCloud, OneDrive, Google Drive

Website and Other Traffic Websites from Alexa Top 1M list
b
Corporate Website — Analytics — Popular pages
data.brisbane.qld.gov.au
csv, excel, json
Updated Feb 20, 2026
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2026). Corporate Website — Analytics — Popular pages [Dataset]. https://data.brisbane.qld.gov.au/explore/dataset/corporate-website-analytics-popular-pages/
Explore at:
json, excel, csvAvailable download formats
Dataset updated
Feb 20, 2026
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Monthly analytics reports for the Brisbane City Council website

Information regarding the sessions for Brisbane City Council website during the month including page views and unique page views.
Z
Popularity Dataset for Online Stats Training
data-staging.niaid.nih.gov
data.niaid.nih.gov
+1more
Updated Aug 25, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rens van de Schoot (2020). Popularity Dataset for Online Stats Training [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_3962122
Explore at:
Dataset updated
Aug 25, 2020
Dataset authored and provided by
Rens van de Schoot
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a dataset used for the online stats training website (https://www.rensvandeschoot.com/tutorials/) and is based on the data used by van de Schoot, van der Velden, Boom, and Brugman (2010).

The dataset is based on a study that investigates an association between popularity status and antisocial behavior from at-risk adolescents (n = 1491), where gender and ethnic background are moderators under the association. The study distinguished subgroups within the popular status group in terms of overt and covert antisocial behavior.For more information on the sample, instruments, methodology, and research context, we refer the interested readers to van de Schoot, van der Velden, Boom, and Brugman (2010).

Variable name Description

Respnr = Respondents’ number

Dutch = Respondents’ ethnic background (0 = Dutch origin, 1 = non-Dutch origin)

gender = Respondents’ gender (0 = boys, 1 = girls)

sd = Adolescents’ socially desirable answering patterns

covert = Covert antisocial behavior

overt = Overt antisocial behavior
i
Netflix
ieee-dataport.org
Updated Oct 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Danil Shamsimukhametov (2021). Netflix [Dataset]. https://ieee-dataport.org/documents/youtube-netflix-web-dataset-encrypted-traffic-classification
Explore at:
Dataset updated
Oct 1, 2021
Authors
Danil Shamsimukhametov
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
YouTube
Description
YouTube flows
m
Data from: SANAD: Single-Label Arabic News Articles Dataset for Automatic...
data.mendeley.com
Updated Sep 2, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Omar Einea (2019). SANAD: Single-Label Arabic News Articles Dataset for Automatic Text Categorization [Dataset]. http://doi.org/10.17632/57zpx667y9.2
Explore at:
Unique identifier
https://doi.org/10.17632/57zpx667y9.2
Dataset updated
Sep 2, 2019
Authors
Omar Einea
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SANAD Dataset is a large collection of Arabic news articles that can be used in different Arabic NLP tasks such as Text Classification and Word Embedding. The articles were collected using Python scripts written specifically for three popular news websites: AlKhaleej, AlArabiya and Akhbarona.

All datasets have seven categories [Culture, Finance, Medical, Politics, Religion, Sports and Tech], except AlArabiya which doesn’t have [Religion]. SANAD contains a total number of 190k+ articles.

How to use it:

Unzip compressed resources.

Each folder contains 6-7 sub-folders which are labeled by the category's name.

Each sub-folder contains a set of article files corresponding to its category.

SANAD_SUBSET is a balanced benchmark dataset (from SANAD) that is used in our research work. It contains the training (90%) and testing (10%) sets.

How to use it:

Unzip the compressed file.

There are 3 main folders containing the 3 datasets: Akhbarona, Khaleej, and Arabiya.

Each dataset-folder contains 2 sub-folders: training and testing.

The training and testing folders include the balanced categories sub-folders.
National Center for Education Statistics Common Core of Data
datalumos.org
Updated Mar 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United States Department of Education. Institute of Education Sciences. National Center for Education Statistics (2025). National Center for Education Statistics Common Core of Data [Dataset]. http://doi.org/10.3886/E221563V1
Explore at:
Unique identifier
https://doi.org/10.3886/E221563V1
Dataset updated
Mar 4, 2025
Dataset provided by
National Center for Education Statisticshttps://nces.ed.gov/
Institute of Education Scienceshttp://ies.ed.gov/
United States Department of Educationhttps://ed.gov/
Authors
United States Department of Education. Institute of Education Sciences. National Center for Education Statistics
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Includes data files and supplemental information. Supplemental information includes a reproducible RMarkdown file, an Excel sheet with metadata, and complete webpage files. Please not that CCD nonfiscal documentation files have been downloaded manually.From the Common Core of Data website:The Common Core of Data (CCD) is the Department of Education's primary database on public elementary and secondary education in the United States. CCD is a comprehensive, annual, national database of all public elementary and secondary schools and school districts.Information on the Common Core of Data (CCD)The primary purpose of the CCD is to provide basic information on public elementary and secondary schools, local education agencies (LEAs), and state education agencies (SEAs) for each state, the District of Columbia, and the outlying territories with a U.S. relationship. CCD is composed of two components: Nonfiscal CCD and Fiscal CCD.
b
Machine Learning Dataset
brightdata.com
.json, .csv, .xlsx
Updated Jun 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2024). Machine Learning Dataset [Dataset]. https://brightdata.com/products/datasets/machine-learning
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Jun 19, 2024
Dataset authored and provided by
Bright Data
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.
m
UI/UX user interaction dataset across popular digital platforms
data.mendeley.com
Updated Nov 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Atikur Rahman (2024). UI/UX user interaction dataset across popular digital platforms [Dataset]. http://doi.org/10.17632/dxthxmnkhx.6
Explore at:
Unique identifier
https://doi.org/10.17632/dxthxmnkhx.6
Dataset updated
Nov 19, 2024
Authors
Md Atikur Rahman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset comprises 2,271 entries and provides insights into user interface (UI) and user experience (UX) preferences across various digital platforms. Key information includes user demographics (Name, Age, Gender) and platform preferences (e.g., Twitter, YouTube, Facebook, Website). It captures user experiences and satisfaction levels with various UI/UX elements such as color schemes, visual hierarchy, typography, multimedia usage, and layout design. The dataset also includes evaluations of mobile responsiveness, call-to-action buttons, form usability, feedback/error messages, loading speed, personalization, accessibility, and interactions (like scrolling behavior and gestures). Each UI/UX component is rated on a scale, allowing for quantitative analysis of user preferences and experiences, making this dataset valuable for research in user-centered design and usability optimization.
g
ClaimsKG - A Knowledge Graph of Fact-Checked Claims (January, 2023)
search.gesis.org
datacatalogue.cessda.eu
Updated Jan 20, 2026
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gangopadhyay, Susmita; Schellhammer, Sebastian; Boland, Katarina; Schüller, Sascha; Todorov, Konstantin; Tchechmedjiev, Andon; Zapilko, Benjamin; Fafalios, Pavlos; Jabeen, Hajira; Dietze, Stefan (2026). ClaimsKG - A Knowledge Graph of Fact-Checked Claims (January, 2023) [Dataset]. http://doi.org/10.7802/2620
Explore at:
Unique identifier
https://doi.org/10.7802/2620
Dataset updated
Jan 20, 2026
Dataset provided by
GESIS, Köln
GESIS search
Authors
Gangopadhyay, Susmita; Schellhammer, Sebastian; Boland, Katarina; Schüller, Sascha; Todorov, Konstantin; Tchechmedjiev, Andon; Zapilko, Benjamin; Fafalios, Pavlos; Jabeen, Hajira; Dietze, Stefan
License
https://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms
Description
ClaimsKG is a knowledge graph of metadata information for fact-checked claims scraped from popular fact-checking sites. In addition to providing a single dataset of claims and associated metadata, truth ratings are harmonized and additional information is provided for each claim, e.g., about mentioned entities. Please see (https://data.gesis.org/claimskg/) for further details about the data model, query examples and statistics.

The dataset facilitates structured queries about claims, their truth values, involved entities, authors, dates, and other kinds of metadata. ClaimsKG is generated through a (semi-)automated pipeline, which harvests claim-related data from popular fact-checking web sites, annotates them with related entities from DBpedia/Wikipedia, and lifts all data to RDF using established vocabularies (such as schema.org).

The latest release of ClaimsKG covers 74066 claims and 72127 Claim Reviews. This is the fourth release of the dataset where data was scraped till Jan 31, 2023 containing claims published between 1996 and 2023 from 13 fact-checking websites. The websites are Fullfact, Politifact, TruthOrFiction, Checkyourfact, Vishvanews, AFP (French), AFP, Polygraph, EU factcheck, Factograph, Fatabyyano, Snopes and Africacheck. The claim-review (fact-checking) period for claims ranges between the year 1996 to 2023. Similar to the previous release, the Entity fishing python client (https://github.com/hirmeos/entity-fishing-client-python) has been used for entity linking and disambiguation in this release. Improvements have been made in the web scraping and data preprocessing pipeline to extract more entities from both claims and claims reviews. Currently, ClaimsKG contains 3408386 entities detected and referenced with DBpedia.

This latest release of ClaimsKG supersedes the previous versions as it contained all the claims from the previous versions together in addition to the additional new claims as well as improved entity annotation resulting in a higher number of entities.
Which social media platforms are most popular
pewresearch.org
csv
Updated Feb 2, 2026
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pew Research Center (2026). Which social media platforms are most popular [Dataset]. https://www.pewresearch.org/internet/fact-sheet/social-media/
Explore at:
csvAvailable download formats
Dataset updated
Feb 2, 2026
Dataset authored and provided by
Pew Research Centerhttp://pewresearch.org/
License
https://www.pewresearch.org/terms-and-conditions/https://www.pewresearch.org/terms-and-conditions/
Description
A line chart that shows % of U.S. adults who say they ever use …
R
VRBO Dataset
rebrowser.net
csv, json
Updated Mar 18, 2026
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rebrowser (2026). VRBO Dataset [Dataset]. https://rebrowser.net/products/datasets/vrbo
Explore at:
csv, jsonAvailable download formats
Dataset updated
Mar 18, 2026
Dataset authored and provided by
Rebrowser
License
https://rebrowser.com/pricinghttps://rebrowser.com/pricing
Description
Access comprehensive datasets containing millions of VRBO vacation rental records with historical pricing, occupancy patterns, and owner performance data. Our curated datasets cover popular vacation destinations with detailed property specifications and guest feedback analytics. Accelerate your vacation rental market research with pre-processed property data eliminating the need for complex scraping infrastructure. Ideal for real estate analysts, academic researchers, and investment firms requiring large-scale vacation rental market intelligence.
Web tracking data for 500 websites popular among Finnish web users
zenodo.org
Updated Apr 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Bailey; Mikael Laakso; Mikael Laakso; Linus Nyman; Linus Nyman; John Bailey (2020). Web tracking data for 500 websites popular among Finnish web users [Dataset]. http://doi.org/10.5281/zenodo.3543444
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3543444
Dataset updated
Apr 18, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
John Bailey; Mikael Laakso; Mikael Laakso; Linus Nyman; Linus Nyman; John Bailey
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset includes observations of trackers present on the top 500 pages popular among Finnish web users as per Alexa. The data collection was conducted using TrackerTracker in five separate requests for five subsets of 100 sites each between 19.8.2017 and 20.8.2017. The tool used a tracker database from March 24, 2017. More methodology details are described in the associated journal article https://doi.org/10.23978/inf.87841
Data from: E2EGit: A Dataset of End-to-End Web Tests in Open Source Projects...
zenodo.org
bin, txt
Updated May 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergio Di Meglio; Sergio Di Meglio; Valeria Pontillo; Valeria Pontillo; Coen De roover; Coen De roover; Luigi Libero Lucio Starace; Luigi Libero Lucio Starace; Sergio Di Martino; Sergio Di Martino; Ruben Opdebeeck; Ruben Opdebeeck (2025). E2EGit: A Dataset of End-to-End Web Tests in Open Source Projects [Dataset]. http://doi.org/10.5281/zenodo.14221860
Explore at:
txt, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14221860
Dataset updated
May 20, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sergio Di Meglio; Sergio Di Meglio; Valeria Pontillo; Valeria Pontillo; Coen De roover; Coen De roover; Luigi Libero Lucio Starace; Luigi Libero Lucio Starace; Sergio Di Martino; Sergio Di Martino; Ruben Opdebeeck; Ruben Opdebeeck
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ABSTRACT
End-to-End (E2E) testing is a comprehensive approach to validating the functionality of a software application by testing its entire workflow from the user’s perspective, ensuring that all integrated components work together as expected. It is crucial for ensuring the quality and reliability of applications, especially in the web domain, which is often bound by Service Level Agreements (SLAs). This testing involves two key activities:
Graphical User Interface (GUI) testing, which simulates user interactions through browsers, and performance testing, which evaluates system workload handling. Despite its importance, E2E testing is often neglected, and the lack of reliable datasets for Web GUI and performance testing has slowed research progress. This paper addresses these limitations by constructing E2EGit, a comprehensive dataset, cataloging non-trivial open-source web projects on GITHUB that adopt GUI or performance testing.
The dataset construction process involved analyzing over 5k non-trivial web repositories based on popular programming languages (JAVA, JAVASCRIPT TYPESCRIPT PYTHON) to identify: 1) GUI tests based on popular browser automation frameworks (SELENIUM PLAYWRIGHT, CYPRESS, PUPPETEER), 2) performance tests written with the most popular open-source tools (JMETER, LOCUST). After analysis, we identified 472 repositories using web GUI testing, with over 43,000 tests, and 84 repositories using performance testing, with 410 tests.

DATASET DESCRIPTION
The dataset is provided as an SQLite database, whose structure is illustrated in Figure 3 (in the paper), which consists of five tables, each serving a specific purpose.
The repository table contains information on 1.5 million repositories collected using the SEART tool on May 4. It includes 34 fields detailing repository characteristics. The
non_trivial_repository table is a subset of the previous one, listing repositories that passed the two filtering stages described in the pipeline. For each repository, it specifies whether it is a web repository using JAVA, JAVASCRIPT, TYPESCRIPT, or PYTHON frameworks. A repository may use multiple frameworks, with corresponding fields (e.g., is web java) set to true, and the field web dependencies listing the detected web frameworks. For Web GUI testing, the dataset includes two additional tables; gui_testing_test _details, where each row represents a test file, providing the file path, the browser automation framework used, the test engine employed, and the number of tests implemented in the file. gui_testing_repo_details, aggregating data from the previous table at the repository level. Each of the 472 repositories has a row summarizing
the number of test files using frameworks like SELENIUM or PLAYWRIGHT, test engines like JUNIT, and the total number of tests identified. For performance testing, the performance_testing_test_details table contains 410 rows, one for each test identified. Each row includes the file path, whether the test uses JMETER or LOCUST, and extracted details such as the number of thread groups, concurrent users, and requests. Notably, some fields may be absent—for instance, if external files (e.g., CSVs defining workloads) were unavailable, or in the case of Locust tests, where parameters like duration and concurrent users are specified via the command line.

To cite this article refer to this citation:

@inproceedings{di2025e2egit,
title={E2EGit: A Dataset of End-to-End Web Tests in Open Source Projects},
author={Di Meglio, Sergio and Starace, Luigi Libero Lucio and Pontillo, Valeria and Opdebeeck, Ruben and De Roover, Coen and Di Martino, Sergio},
booktitle={2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR)},
pages={10--15},
year={2025},
organization={IEEE/ACM}
}

This work has been partially supported by the Italian PNRR MUR project PE0000013-FAIR.
m
Ultimate Arabic News Dataset
data.mendeley.com
opendatalab.com
+1more
Updated Jul 4, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed Hashim Al-Dulaimi (2022). Ultimate Arabic News Dataset [Dataset]. http://doi.org/10.17632/jz56k5wxz7.2
Explore at:
Unique identifier
https://doi.org/10.17632/jz56k5wxz7.2
Dataset updated
Jul 4, 2022
Authors
Ahmed Hashim Al-Dulaimi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Ultimate Arabic News Dataset is a collection of single-label modern Arabic texts that are used in news websites and press articles.

Arabic news data was collected by web scraping techniques from many famous news sites such as Al-Arabiya, Al-Youm Al-Sabea (Youm7), the news published on the Google search engine and other various sources.

The data we collect consists of two Primary files:

UltimateArabic: A file containing more than 193,000 original Arabic news texts, without pre-processing. The texts contain words, numbers, and symbols that can be removed using pre-processing to increase accuracy when using the dataset in various Arabic natural language processing tasks such as text classification.

UltimateArabicPrePros: It is a file that contains the data mentioned in the first file, but after pre-processing, where the number of data became about 188,000 text documents, where stop words, non-Arabic words, symbols and numbers have been removed so that this file is ready for use directly in the various Arabic natural language processing tasks. Like text classification.

We have added two folders containing additional detailed datasets:

1- Sample: This folder contains samples of the results of web-scraping techniques for two popular Arab websites in two different news categories, Sports and Politics. this folder contain two datasets:

Sample_Youm7_Politic: An example of news in the "Politic" category collected from the Youm7 website. Sample_alarabiya_Sport: An example of news in the "Sport" category collected from the Al-Arabiya website.

2- Dataset Versions: This volume contains four different versions of the original data set, from which the appropriate version can be selected for use in text classification techniques. The first data set (Original) contains the raw data without pre-processing the data in any way, so the number of tokens in the first data set is very high. In the second data set (Original_without_Stop) the data was cleaned, such as removing symbols, numbers, and non-Arabic words, as well as stop words, so the number of symbols is greatly reduced. In the third dataset (Original_with_Stem) the data was cleaned, and text stemming technique was used to remove all additions and suffixes that might affect the accuracy of the results and to obtain the words roots. In the 4th edition of the dataset (Original_Without_Stop_Stem) all preprocessing techniques such as data cleaning, stop word removal and text stemming technique were applied, so we note that the number of tokens in the 4th edition is the lowest among all releases.

The data is divided into 10 different categories: Culture, Diverse, Economy, Sport, Politic, Art, Society, Technology, Medical and Religion.
b
G2 Dataset
brightdata.com
.json, .csv, .xlsx
Updated May 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2024). G2 Dataset [Dataset]. https://brightdata.com/products/datasets/g2
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
May 6, 2024
Dataset authored and provided by
Bright Data
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
Use our G2 dataset to collect product descriptions, ratings, reviews, and pricing information from the world's largest tech marketplace. You may purchase a full or partial dataset depending on your business needs. The G2 Software Products Dataset, with a focus on top-rated products, serves as a valuable resource for software buyers, businesses, and technology enthusiasts. This use case highlights products that have received exceptional ratings and positive reviews on the G2 platform, offering insights into customer satisfaction and popularity. For software buyers, this dataset acts as a trusted guide, presenting a curated selection of G2's top-rated software products, ensuring a higher likelihood of satisfaction with purchases. Businesses and technology professionals can leverage this dataset to identify popular and well-reviewed software solutions, optimizing their decision-making process. This use case emphasizes the dataset's utility for those specifically interested in exploring and acquiring top-rated software products from G2's Product Overview The G2 software products and reviews dataset offer a detailed and thorough overview of leading software companies. The dataset includes all major data points: Product descriptions Average rating (1-5) Sellers number of reviews Key features (highest and lowest rated) Competitors Website & social media links and more.

Facebook

Twitter

Click to copy link

Link copied

Cite

Christopher Pratt (2023). Popular Website Screenshots and Metadata [Dataset]. https://www.kaggle.com/datasets/christopherpratt/popular-website-screenshots-and-metadata

Popular Website Screenshots and Metadata

1,000 screenshots with detailed metadata from the world's most visited websites

Explore at:

zip(1273641347 bytes)Available download formats

Dataset updated

Jan 6, 2023

Authors

Christopher Pratt

License

Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically

Description

Silatus is sharing, for free, a segment of a dataset that we are using to train a generative AI model for text-to-mockup conversions. This dataset was collected in December 2022 and early January 2023, so it contains very recent data from 1,000 of the world's most popular websites. You can get our larger 10,000 website dataset for free at: https://silatus.com/datasets

This dataset includes:

High-res screenshots

1024x1024px
Loaded Javascript
Loaded Images

Text metadata

Site title
Navbar content
Full page text data
Page description

Visual metadata

Content (images, videos, inputs, buttons) absolute & relative positions
Color profile
Base font

Clear search

Close search

Google apps

Main menu

Popular Website Screenshots and Metadata

Popular Websites (edited) Dataset

Popular Websites (edited)

Popular websites across the globe

Context

Content

Acknowledgements

Inspiration

Popular Websites (augmented + Nonedited) Dataset

Popular Websites (augmented + Nonedited)

Dataset Search WebApp

Dataset used for HTTPS traffic classification using packet burst statistics

Corporate Website — Analytics — Popular pages

Popularity Dataset for Online Stats Training

Netflix

Data from: SANAD: Single-Label Arabic News Articles Dataset for Automatic...

National Center for Education Statistics Common Core of Data

Machine Learning Dataset

UI/UX user interaction dataset across popular digital platforms

ClaimsKG - A Knowledge Graph of Fact-Checked Claims (January, 2023)

Which social media platforms are most popular

VRBO Dataset

Web tracking data for 500 websites popular among Finnish web users

Data from: E2EGit: A Dataset of End-to-End Web Tests in Open Source Projects...

Ultimate Arabic News Dataset

G2 Dataset

Popular Website Screenshots and MetadataSee More Versions

1,000 screenshots with detailed metadata from the world's most visited websites

Popular Website Screenshots and Metadata