100+ datasets found
  1. Popular Website Screenshots and Metadata

    • kaggle.com
    zip
    Updated Jan 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christopher Pratt (2023). Popular Website Screenshots and Metadata [Dataset]. https://www.kaggle.com/datasets/christopherpratt/popular-website-screenshots-and-metadata
    Explore at:
    zip(1273641347 bytes)Available download formats
    Dataset updated
    Jan 6, 2023
    Authors
    Christopher Pratt
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Silatus is sharing, for free, a segment of a dataset that we are using to train a generative AI model for text-to-mockup conversions. This dataset was collected in December 2022 and early January 2023, so it contains very recent data from 1,000 of the world's most popular websites. You can get our larger 10,000 website dataset for free at: https://silatus.com/datasets

    This dataset includes:

    High-res screenshots

    • 1024x1024px
    • Loaded Javascript
    • Loaded Images

    Text metadata

    • Site title
    • Navbar content
    • Full page text data
    • Page description

    Visual metadata

    • Content (images, videos, inputs, buttons) absolute & relative positions
    • Color profile
    • Base font
  2. R

    Popular Websites (edited) Dataset

    • universe.roboflow.com
    zip
    Updated Aug 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bro (2025). Popular Websites (edited) Dataset [Dataset]. https://universe.roboflow.com/bro-klhic/popular-websites-edited
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 2, 2025
    Dataset authored and provided by
    Bro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Popular Webpages Bounding Boxes
    Description

    Popular Websites (edited)

    ## Overview
    
    Popular Websites (edited) is a dataset for object detection tasks - it contains Popular Webpages annotations for 552 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  3. Popular websites across the globe

    • kaggle.com
    zip
    Updated May 27, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    bpali26 (2017). Popular websites across the globe [Dataset]. https://www.kaggle.com/bpali26/popular-websites-across-the-globe
    Explore at:
    zip(639485 bytes)Available download formats
    Dataset updated
    May 27, 2017
    Authors
    bpali26
    Description

    Context

    This dataset includes some of the basic information of the websites we daily use. While scrapping this info, I learned quite a lot in R programming, system speed, memory usage etc. and developed my niche in Web Scrapping. It took about 4-5 hrs for scrapping this data through my system (4GB RAM) and nearly about 4-5 days working out my idea through this project.

    Content

    The dataset contains Top 50 ranked sites from each 191 countries along with their traffic (global) rank. Here, country_rank represent the traffic rank of that site within the country, and traffic_rank represent the global traffic rank of that site.

    Since most of the columns meaning can be derived from their name itself, its pretty much straight forward to understand this dataset. However, there are some instances of confusion which I would like to explain in here:

    1) most of the numeric values are in character format, hence, contain spaces which you might need to clean on.

    2) There are multiple instances of same website. for.e.g. Yahoo. com is present in 179 rows within this dataset. This is due to their different country rank in each country.

    3)The information provided in this dataset is for the top 50 websites in 191 countries as on 25th May 2017 and is subjected to change in future time due to the dynamic structure of ranking.

    4) The dataset inactual contains 9540 rows instead of 9550(50*191 rows). This was due to the unavailability of information for 10 websites.

    PS: in case if there are anymore queries, comment on this, I'll add an answer to that in above list.

    Acknowledgements

    I wouldn't have done this without the help of others. I've scrapped this information from publicly available (open to all) websites namely: 1) http://data.danetsoft.com/ 2) http://www.alexa.com/topsites , of which i'm highly grateful. I truly appreciate and thanks the owner of these sites for providing us with the information that I included today in this dataset.

    Inspiration

    I feel that there this a lot of scope for exploring & visualization this dataset to find out the trends in the attributes of these websites across countries. Also, one could try predicting the traffic(global) rank being a dependent factor on the other attributes of the website. In any case, this dataset will help you find out the popular sites in your area.

  4. R

    Popular Websites (augmented + Nonedited) Dataset

    • universe.roboflow.com
    zip
    Updated Aug 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bro (2025). Popular Websites (augmented + Nonedited) Dataset [Dataset]. https://universe.roboflow.com/bro-klhic/popular-websites-augmented-nonedited/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 2, 2025
    Dataset authored and provided by
    Bro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Ui Elements Bounding Boxes
    Description

    Popular Websites (augmented + Nonedited)

    ## Overview
    
    Popular Websites (augmented + Nonedited) is a dataset for object detection tasks - it contains Ui Elements annotations for 602 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  5. Dataset Search WebApp

    • figshare.com
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Angelo Batista Neves Júnior; Luiz André Portes Paes Leme (2023). Dataset Search WebApp [Dataset]. http://doi.org/10.6084/m9.figshare.5217958.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Angelo Batista Neves Júnior; Luiz André Portes Paes Leme
    License

    https://www.gnu.org/copyleft/gpl.htmlhttps://www.gnu.org/copyleft/gpl.html

    Description

    Despite the fact that extensive list of open datasets are available in catalogues, most of the data publishers still connects their datasets to other popular datasets, such as DBpedia5, Freebase 6 and Geonames7. Although the linkage with popular datasets would allow us to explore external resources, it would fail to cover highly specialized information. Catalogues of linked data describe the content of datasets in terms of the update periodicity, authors, SPARQL endpoints, linksets with other datasets, amongst others, as recommended by W3C VoID Vocabulary. However, catalogues by themselves do not provide any explicit information to help the URI linkage process.Searching techniques can rank available datasets SI according to the probability that it will be possible to define links between URIs of SI and a given dataset T to be published, so that most of the links, if not all, could be found by inspecting the most relevant datasets in the ranking. dataset-search is a tool for searching datasets for linkage.

  6. Dataset used for HTTPS traffic classification using packet burst statistics

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Apr 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tropkova Zdena; Hynek Karel; Cejka Tomas (2022). Dataset used for HTTPS traffic classification using packet burst statistics [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4911550
    Explore at:
    Dataset updated
    Apr 11, 2022
    Dataset provided by
    CESNEThttp://www.cesnet.cz/
    FIT CTU
    Authors
    Tropkova Zdena; Hynek Karel; Cejka Tomas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We are publishing a dataset we created for the HTTPS traffic classification.

    Since the data were captured mainly in the real backbone network, we omitted IP addresses and ports. The datasets consist of calculated from bidirectional flows exported with flow probe Ipifixprobe. This exporter can export a sequence of packet lengths and times and a sequence of packet bursts and time. For more information, please visit ipfixprobe repository (Ipifixprobe).

    During our research, we divided HTTPS into five categories: L -- Live Video Streaming, P -- Video Player, M -- Music Player, U -- File Upload, D -- File Download, W -- Website, and other traffic.

    We have chosen the service representatives known for particular traffic types based on the Alexa Top 1M list and Moz's list of the most popular 500 websites for each category. We also used several popular websites that primarily focus on the audience in our country. The identified traffic classes and their representatives are provided below:

    Live Video Stream Twitch, Czech TV, YouTube Live

    Video Player DailyMotion, Stream.cz, Vimeo, YouTube

    Music Player AppleMusic, Spotify, SoundCloud

    File Upload/Download FileSender, OwnCloud, OneDrive, Google Drive

    Website and Other Traffic Websites from Alexa Top 1M list

  7. b

    Corporate Website — Analytics — Popular pages

    • data.brisbane.qld.gov.au
    csv, excel, json
    Updated Feb 20, 2026
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2026). Corporate Website — Analytics — Popular pages [Dataset]. https://data.brisbane.qld.gov.au/explore/dataset/corporate-website-analytics-popular-pages/
    Explore at:
    json, excel, csvAvailable download formats
    Dataset updated
    Feb 20, 2026
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Monthly analytics reports for the Brisbane City Council website

    Information regarding the sessions for Brisbane City Council website during the month including page views and unique page views.

  8. Z

    Popularity Dataset for Online Stats Training

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    • +1more
    Updated Aug 25, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rens van de Schoot (2020). Popularity Dataset for Online Stats Training [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_3962122
    Explore at:
    Dataset updated
    Aug 25, 2020
    Dataset authored and provided by
    Rens van de Schoot
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a dataset used for the online stats training website (https://www.rensvandeschoot.com/tutorials/) and is based on the data used by van de Schoot, van der Velden, Boom, and Brugman (2010).

    The dataset is based on a study that investigates an association between popularity status and antisocial behavior from at-risk adolescents (n = 1491), where gender and ethnic background are moderators under the association. The study distinguished subgroups within the popular status group in terms of overt and covert antisocial behavior.For more information on the sample, instruments, methodology, and research context, we refer the interested readers to van de Schoot, van der Velden, Boom, and Brugman (2010).

    Variable name Description

    Respnr = Respondents’ number

    Dutch = Respondents’ ethnic background (0 = Dutch origin, 1 = non-Dutch origin)

    gender = Respondents’ gender (0 = boys, 1 = girls)

    sd = Adolescents’ socially desirable answering patterns

    covert = Covert antisocial behavior

    overt = Overt antisocial behavior

  9. i

    Netflix

    • ieee-dataport.org
    Updated Oct 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Danil Shamsimukhametov (2021). Netflix [Dataset]. https://ieee-dataport.org/documents/youtube-netflix-web-dataset-encrypted-traffic-classification
    Explore at:
    Dataset updated
    Oct 1, 2021
    Authors
    Danil Shamsimukhametov
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    YouTube
    Description

    YouTube flows

  10. m

    Data from: SANAD: Single-Label Arabic News Articles Dataset for Automatic...

    • data.mendeley.com
    Updated Sep 2, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Omar Einea (2019). SANAD: Single-Label Arabic News Articles Dataset for Automatic Text Categorization [Dataset]. http://doi.org/10.17632/57zpx667y9.2
    Explore at:
    Dataset updated
    Sep 2, 2019
    Authors
    Omar Einea
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SANAD Dataset is a large collection of Arabic news articles that can be used in different Arabic NLP tasks such as Text Classification and Word Embedding. The articles were collected using Python scripts written specifically for three popular news websites: AlKhaleej, AlArabiya and Akhbarona.

    All datasets have seven categories [Culture, Finance, Medical, Politics, Religion, Sports and Tech], except AlArabiya which doesn’t have [Religion]. SANAD contains a total number of 190k+ articles.

    How to use it:

    1. Unzip compressed resources.
    2. Each folder contains 6-7 sub-folders which are labeled by the category's name.
    3. Each sub-folder contains a set of article files corresponding to its category.

    SANAD_SUBSET is a balanced benchmark dataset (from SANAD) that is used in our research work. It contains the training (90%) and testing (10%) sets.

    How to use it:

    1. Unzip the compressed file.
    2. There are 3 main folders containing the 3 datasets: Akhbarona, Khaleej, and Arabiya.
    3. Each dataset-folder contains 2 sub-folders: training and testing.
    4. The training and testing folders include the balanced categories sub-folders.
  11. National Center for Education Statistics Common Core of Data

    • datalumos.org
    Updated Mar 4, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States Department of Education. Institute of Education Sciences. National Center for Education Statistics (2025). National Center for Education Statistics Common Core of Data [Dataset]. http://doi.org/10.3886/E221563V1
    Explore at:
    Dataset updated
    Mar 4, 2025
    Dataset provided by
    National Center for Education Statisticshttps://nces.ed.gov/
    Institute of Education Scienceshttp://ies.ed.gov/
    United States Department of Educationhttps://ed.gov/
    Authors
    United States Department of Education. Institute of Education Sciences. National Center for Education Statistics
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Includes data files and supplemental information. Supplemental information includes a reproducible RMarkdown file, an Excel sheet with metadata, and complete webpage files. Please not that CCD nonfiscal documentation files have been downloaded manually.From the Common Core of Data website:The Common Core of Data (CCD) is the Department of Education's primary database on public elementary and secondary education in the United States. CCD is a comprehensive, annual, national database of all public elementary and secondary schools and school districts.Information on the Common Core of Data (CCD)The primary purpose of the CCD is to provide basic information on public elementary and secondary schools, local education agencies (LEAs), and state education agencies (SEAs) for each state, the District of Columbia, and the outlying territories with a U.S. relationship. CCD is composed of two components: Nonfiscal CCD and Fiscal CCD.

  12. b

    Machine Learning Dataset

    • brightdata.com
    .json, .csv, .xlsx
    Updated Jun 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data (2024). Machine Learning Dataset [Dataset]. https://brightdata.com/products/datasets/machine-learning
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset updated
    Jun 19, 2024
    Dataset authored and provided by
    Bright Data
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    Worldwide
    Description

    Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.

  13. m

    UI/UX user interaction dataset across popular digital platforms

    • data.mendeley.com
    Updated Nov 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Atikur Rahman (2024). UI/UX user interaction dataset across popular digital platforms [Dataset]. http://doi.org/10.17632/dxthxmnkhx.6
    Explore at:
    Dataset updated
    Nov 19, 2024
    Authors
    Md Atikur Rahman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset comprises 2,271 entries and provides insights into user interface (UI) and user experience (UX) preferences across various digital platforms. Key information includes user demographics (Name, Age, Gender) and platform preferences (e.g., Twitter, YouTube, Facebook, Website). It captures user experiences and satisfaction levels with various UI/UX elements such as color schemes, visual hierarchy, typography, multimedia usage, and layout design. The dataset also includes evaluations of mobile responsiveness, call-to-action buttons, form usability, feedback/error messages, loading speed, personalization, accessibility, and interactions (like scrolling behavior and gestures). Each UI/UX component is rated on a scale, allowing for quantitative analysis of user preferences and experiences, making this dataset valuable for research in user-centered design and usability optimization.

  14. g

    ClaimsKG - A Knowledge Graph of Fact-Checked Claims (January, 2023)

    • search.gesis.org
    • datacatalogue.cessda.eu
    Updated Jan 20, 2026
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gangopadhyay, Susmita; Schellhammer, Sebastian; Boland, Katarina; Schüller, Sascha; Todorov, Konstantin; Tchechmedjiev, Andon; Zapilko, Benjamin; Fafalios, Pavlos; Jabeen, Hajira; Dietze, Stefan (2026). ClaimsKG - A Knowledge Graph of Fact-Checked Claims (January, 2023) [Dataset]. http://doi.org/10.7802/2620
    Explore at:
    Dataset updated
    Jan 20, 2026
    Dataset provided by
    GESIS, Köln
    GESIS search
    Authors
    Gangopadhyay, Susmita; Schellhammer, Sebastian; Boland, Katarina; Schüller, Sascha; Todorov, Konstantin; Tchechmedjiev, Andon; Zapilko, Benjamin; Fafalios, Pavlos; Jabeen, Hajira; Dietze, Stefan
    License

    https://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms

    Description

    ClaimsKG is a knowledge graph of metadata information for fact-checked claims scraped from popular fact-checking sites. In addition to providing a single dataset of claims and associated metadata, truth ratings are harmonized and additional information is provided for each claim, e.g., about mentioned entities. Please see (https://data.gesis.org/claimskg/) for further details about the data model, query examples and statistics.

    The dataset facilitates structured queries about claims, their truth values, involved entities, authors, dates, and other kinds of metadata. ClaimsKG is generated through a (semi-)automated pipeline, which harvests claim-related data from popular fact-checking web sites, annotates them with related entities from DBpedia/Wikipedia, and lifts all data to RDF using established vocabularies (such as schema.org).

    The latest release of ClaimsKG covers 74066 claims and 72127 Claim Reviews. This is the fourth release of the dataset where data was scraped till Jan 31, 2023 containing claims published between 1996 and 2023 from 13 fact-checking websites. The websites are Fullfact, Politifact, TruthOrFiction, Checkyourfact, Vishvanews, AFP (French), AFP, Polygraph, EU factcheck, Factograph, Fatabyyano, Snopes and Africacheck. The claim-review (fact-checking) period for claims ranges between the year 1996 to 2023. Similar to the previous release, the Entity fishing python client (https://github.com/hirmeos/entity-fishing-client-python) has been used for entity linking and disambiguation in this release. Improvements have been made in the web scraping and data preprocessing pipeline to extract more entities from both claims and claims reviews. Currently, ClaimsKG contains 3408386 entities detected and referenced with DBpedia.

    This latest release of ClaimsKG supersedes the previous versions as it contained all the claims from the previous versions together in addition to the additional new claims as well as improved entity annotation resulting in a higher number of entities.

  15. Which social media platforms are most popular

    • pewresearch.org
    csv
    Updated Feb 2, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pew Research Center (2026). Which social media platforms are most popular [Dataset]. https://www.pewresearch.org/internet/fact-sheet/social-media/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Feb 2, 2026
    Dataset authored and provided by
    Pew Research Centerhttp://pewresearch.org/
    License

    https://www.pewresearch.org/terms-and-conditions/https://www.pewresearch.org/terms-and-conditions/

    Description

    A line chart that shows % of U.S. adults who say they ever use …

  16. R

    VRBO Dataset

    • rebrowser.net
    csv, json
    Updated Mar 18, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rebrowser (2026). VRBO Dataset [Dataset]. https://rebrowser.net/products/datasets/vrbo
    Explore at:
    csv, jsonAvailable download formats
    Dataset updated
    Mar 18, 2026
    Dataset authored and provided by
    Rebrowser
    License

    https://rebrowser.com/pricinghttps://rebrowser.com/pricing

    Description

    Access comprehensive datasets containing millions of VRBO vacation rental records with historical pricing, occupancy patterns, and owner performance data. Our curated datasets cover popular vacation destinations with detailed property specifications and guest feedback analytics. Accelerate your vacation rental market research with pre-processed property data eliminating the need for complex scraping infrastructure. Ideal for real estate analysts, academic researchers, and investment firms requiring large-scale vacation rental market intelligence.

  17. Web tracking data for 500 websites popular among Finnish web users

    • zenodo.org
    Updated Apr 18, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Bailey; Mikael Laakso; Mikael Laakso; Linus Nyman; Linus Nyman; John Bailey (2020). Web tracking data for 500 websites popular among Finnish web users [Dataset]. http://doi.org/10.5281/zenodo.3543444
    Explore at:
    Dataset updated
    Apr 18, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    John Bailey; Mikael Laakso; Mikael Laakso; Linus Nyman; Linus Nyman; John Bailey
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset includes observations of trackers present on the top 500 pages popular among Finnish web users as per Alexa. The data collection was conducted using TrackerTracker in five separate requests for five subsets of 100 sites each between 19.8.2017 and 20.8.2017. The tool used a tracker database from March 24, 2017. More methodology details are described in the associated journal article https://doi.org/10.23978/inf.87841

  18. Data from: E2EGit: A Dataset of End-to-End Web Tests in Open Source Projects...

    • zenodo.org
    bin, txt
    Updated May 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sergio Di Meglio; Sergio Di Meglio; Valeria Pontillo; Valeria Pontillo; Coen De roover; Coen De roover; Luigi Libero Lucio Starace; Luigi Libero Lucio Starace; Sergio Di Martino; Sergio Di Martino; Ruben Opdebeeck; Ruben Opdebeeck (2025). E2EGit: A Dataset of End-to-End Web Tests in Open Source Projects [Dataset]. http://doi.org/10.5281/zenodo.14221860
    Explore at:
    txt, binAvailable download formats
    Dataset updated
    May 20, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sergio Di Meglio; Sergio Di Meglio; Valeria Pontillo; Valeria Pontillo; Coen De roover; Coen De roover; Luigi Libero Lucio Starace; Luigi Libero Lucio Starace; Sergio Di Martino; Sergio Di Martino; Ruben Opdebeeck; Ruben Opdebeeck
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ABSTRACT
    End-to-End (E2E) testing is a comprehensive approach to validating the functionality of a software application by testing its entire workflow from the user’s perspective, ensuring that all integrated components work together as expected. It is crucial for ensuring the quality and reliability of applications, especially in the web domain, which is often bound by Service Level Agreements (SLAs). This testing involves two key activities:
    Graphical User Interface (GUI) testing, which simulates user interactions through browsers, and performance testing, which evaluates system workload handling. Despite its importance, E2E testing is often neglected, and the lack of reliable datasets for Web GUI and performance testing has slowed research progress. This paper addresses these limitations by constructing E2EGit, a comprehensive dataset, cataloging non-trivial open-source web projects on GITHUB that adopt GUI or performance testing.
    The dataset construction process involved analyzing over 5k non-trivial web repositories based on popular programming languages (JAVA, JAVASCRIPT TYPESCRIPT PYTHON) to identify: 1) GUI tests based on popular browser automation frameworks (SELENIUM PLAYWRIGHT, CYPRESS, PUPPETEER), 2) performance tests written with the most popular open-source tools (JMETER, LOCUST). After analysis, we identified 472 repositories using web GUI testing, with over 43,000 tests, and 84 repositories using performance testing, with 410 tests.


    DATASET DESCRIPTION
    The dataset is provided as an SQLite database, whose structure is illustrated in Figure 3 (in the paper), which consists of five tables, each serving a specific purpose.
    The repository table contains information on 1.5 million repositories collected using the SEART tool on May 4. It includes 34 fields detailing repository characteristics. The
    non_trivial_repository table is a subset of the previous one, listing repositories that passed the two filtering stages described in the pipeline. For each repository, it specifies whether it is a web repository using JAVA, JAVASCRIPT, TYPESCRIPT, or PYTHON frameworks. A repository may use multiple frameworks, with corresponding fields (e.g., is web java) set to true, and the field web dependencies listing the detected web frameworks. For Web GUI testing, the dataset includes two additional tables; gui_testing_test _details, where each row represents a test file, providing the file path, the browser automation framework used, the test engine employed, and the number of tests implemented in the file. gui_testing_repo_details, aggregating data from the previous table at the repository level. Each of the 472 repositories has a row summarizing
    the number of test files using frameworks like SELENIUM or PLAYWRIGHT, test engines like JUNIT, and the total number of tests identified. For performance testing, the performance_testing_test_details table contains 410 rows, one for each test identified. Each row includes the file path, whether the test uses JMETER or LOCUST, and extracted details such as the number of thread groups, concurrent users, and requests. Notably, some fields may be absent—for instance, if external files (e.g., CSVs defining workloads) were unavailable, or in the case of Locust tests, where parameters like duration and concurrent users are specified via the command line.

    To cite this article refer to this citation:

    @inproceedings{di2025e2egit,
    title={E2EGit: A Dataset of End-to-End Web Tests in Open Source Projects},
    author={Di Meglio, Sergio and Starace, Luigi Libero Lucio and Pontillo, Valeria and Opdebeeck, Ruben and De Roover, Coen and Di Martino, Sergio},
    booktitle={2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR)},
    pages={10--15},
    year={2025},
    organization={IEEE/ACM}
    }

    This work has been partially supported by the Italian PNRR MUR project PE0000013-FAIR.

  19. m

    Ultimate Arabic News Dataset

    • data.mendeley.com
    • opendatalab.com
    • +1more
    Updated Jul 4, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Hashim Al-Dulaimi (2022). Ultimate Arabic News Dataset [Dataset]. http://doi.org/10.17632/jz56k5wxz7.2
    Explore at:
    Dataset updated
    Jul 4, 2022
    Authors
    Ahmed Hashim Al-Dulaimi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Ultimate Arabic News Dataset is a collection of single-label modern Arabic texts that are used in news websites and press articles.

    Arabic news data was collected by web scraping techniques from many famous news sites such as Al-Arabiya, Al-Youm Al-Sabea (Youm7), the news published on the Google search engine and other various sources.

    • The data we collect consists of two Primary files:

    UltimateArabic: A file containing more than 193,000 original Arabic news texts, without pre-processing. The texts contain words, numbers, and symbols that can be removed using pre-processing to increase accuracy when using the dataset in various Arabic natural language processing tasks such as text classification.

    UltimateArabicPrePros: It is a file that contains the data mentioned in the first file, but after pre-processing, where the number of data became about 188,000 text documents, where stop words, non-Arabic words, symbols and numbers have been removed so that this file is ready for use directly in the various Arabic natural language processing tasks. Like text classification.

    • We have added two folders containing additional detailed datasets:

    1- Sample: This folder contains samples of the results of web-scraping techniques for two popular Arab websites in two different news categories, Sports and Politics. this folder contain two datasets:

    Sample_Youm7_Politic: An example of news in the "Politic" category collected from the Youm7 website. Sample_alarabiya_Sport: An example of news in the "Sport" category collected from the Al-Arabiya website.

    2- Dataset Versions: This volume contains four different versions of the original data set, from which the appropriate version can be selected for use in text classification techniques. The first data set (Original) contains the raw data without pre-processing the data in any way, so the number of tokens in the first data set is very high. In the second data set (Original_without_Stop) the data was cleaned, such as removing symbols, numbers, and non-Arabic words, as well as stop words, so the number of symbols is greatly reduced. In the third dataset (Original_with_Stem) the data was cleaned, and text stemming technique was used to remove all additions and suffixes that might affect the accuracy of the results and to obtain the words roots. In the 4th edition of the dataset (Original_Without_Stop_Stem) all preprocessing techniques such as data cleaning, stop word removal and text stemming technique were applied, so we note that the number of tokens in the 4th edition is the lowest among all releases.

    • The data is divided into 10 different categories: Culture, Diverse, Economy, Sport, Politic, Art, Society, Technology, Medical and Religion.
  20. b

    G2 Dataset

    • brightdata.com
    .json, .csv, .xlsx
    Updated May 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data (2024). G2 Dataset [Dataset]. https://brightdata.com/products/datasets/g2
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset updated
    May 6, 2024
    Dataset authored and provided by
    Bright Data
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    Worldwide
    Description

    Use our G2 dataset to collect product descriptions, ratings, reviews, and pricing information from the world's largest tech marketplace. You may purchase a full or partial dataset depending on your business needs. The G2 Software Products Dataset, with a focus on top-rated products, serves as a valuable resource for software buyers, businesses, and technology enthusiasts. This use case highlights products that have received exceptional ratings and positive reviews on the G2 platform, offering insights into customer satisfaction and popularity. For software buyers, this dataset acts as a trusted guide, presenting a curated selection of G2's top-rated software products, ensuring a higher likelihood of satisfaction with purchases. Businesses and technology professionals can leverage this dataset to identify popular and well-reviewed software solutions, optimizing their decision-making process. This use case emphasizes the dataset's utility for those specifically interested in exploring and acquiring top-rated software products from G2's Product Overview The G2 software products and reviews dataset offer a detailed and thorough overview of leading software companies. The dataset includes all major data points: Product descriptions Average rating (1-5) Sellers number of reviews Key features (highest and lowest rated) Competitors Website & social media links and more.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Christopher Pratt (2023). Popular Website Screenshots and Metadata [Dataset]. https://www.kaggle.com/datasets/christopherpratt/popular-website-screenshots-and-metadata
Organization logo

Popular Website Screenshots and Metadata

1,000 screenshots with detailed metadata from the world's most visited websites

Explore at:
zip(1273641347 bytes)Available download formats
Dataset updated
Jan 6, 2023
Authors
Christopher Pratt
License

Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically

Description

Silatus is sharing, for free, a segment of a dataset that we are using to train a generative AI model for text-to-mockup conversions. This dataset was collected in December 2022 and early January 2023, so it contains very recent data from 1,000 of the world's most popular websites. You can get our larger 10,000 website dataset for free at: https://silatus.com/datasets

This dataset includes:

High-res screenshots

  • 1024x1024px
  • Loaded Javascript
  • Loaded Images

Text metadata

  • Site title
  • Navbar content
  • Full page text data
  • Page description

Visual metadata

  • Content (images, videos, inputs, buttons) absolute & relative positions
  • Color profile
  • Base font
Search
Clear search
Close search
Google apps
Main menu