55 datasets found

Website Statistics
data.wu.ac.at
data.europa.eu
csv, pdf
Updated Jun 11, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lincolnshire County Council (2018). Website Statistics [Dataset]. https://data.wu.ac.at/schema/data_gov_uk/M2ZkZDBjOTUtMzNhYi00YWRjLWI1OWMtZmUzMzA5NjM0ZTdk
Explore at:
csv, pdfAvailable download formats
Dataset updated
Jun 11, 2018
Dataset provided by
Lincolnshire County Councilhttp://www.lincolnshire.gov.uk/
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Description
This Website Statistics dataset has four resources showing usage of the Lincolnshire Open Data website. Web analytics terms used in each resource are defined in their accompanying Metadata file.

Website Usage Statistics: This document shows a statistical summary of usage of the Lincolnshire Open Data site for the latest calendar year.

Website Statistics Summary: This dataset shows a website statistics summary for the Lincolnshire Open Data site for the latest calendar year.

Webpage Statistics: This dataset shows statistics for individual Webpages on the Lincolnshire Open Data site by calendar year.

Dataset Statistics: This dataset shows cumulative totals for Datasets on the Lincolnshire Open Data site that have also been published on the national Open Data site Data.Gov.UK - see the Source link.

Note: Website and Webpage statistics (the first three resources above) show only UK users, and exclude API calls (automated requests for datasets). The Dataset Statistics are confined to users with javascript enabled, which excludes web crawlers and API calls.

These Website Statistics resources are updated annually in January by the Lincolnshire County Council Business Intelligence team. For any enquiries about the information contact opendata@lincolnshire.gov.uk.
P
CoVA Dataset
paperswithcode.com
Updated Oct 23, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anurendra Kumar; Keval Morabia; Jingjin Wang; Kevin Chen-Chuan Chang; Alexander Schwing (2021). CoVA Dataset [Dataset]. https://paperswithcode.com/dataset/cova
Explore at:
Dataset updated
Oct 23, 2021
Authors
Anurendra Kumar; Keval Morabia; Jingjin Wang; Kevin Chen-Chuan Chang; Alexander Schwing
Description
We labeled 7,740 webpage screenshots spanning 408 domains (Amazon, Walmart, Target, etc.). Each of these webpages contains exactly one labeled price, title, and image. All other web elements are labeled as background. On average, there are 90 web elements in a webpage.

Webpage screenshots and bounding boxes can be obtained here

Train-Val-Test split We create a cross-domain split which ensures that each of the train, val and test sets contains webpages from different domains. Specifically, we construct a 3 : 1 : 1 split based on the number of distinct domains. We observed that the top-5 domains (based on number of samples) were Amazon, EBay, Walmart, Etsy, and Target. So, we created 5 different splits for 5-Fold Cross Validation such that each of the major domains is present in one of the 5 splits for test data.
Phishing Websites Detection
kaggle.com
Updated May 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
J Akshaya (2020). Phishing Websites Detection [Dataset]. https://www.kaggle.com/akshaya1508/phishing-websites-detection/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 28, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
J Akshaya
Description
Context

Phishing is a form of identity theft that occurs when a malicious website impersonates a legitimate one in order to acquire sensitive information such as passwords, account details, or credit card numbers. People generally tend to fall pray to this very easily. Kudos to the commendable craftsmanship of the attackers which makes people believe that it is a legitimate website. There is a need to identify the potential phishing websites and differentiate them from the legitimate ones. This dataset identifies the prominent features of the phishing websites, 10 such features have been identified.

Content

Generally, the open source datasets available on the internet do not comes with the code and the logic which arises certain problems i.e.:

Limited Data: The ML algorithms can only be tested with the existing phishing URLs and no new phishing URLS can be checked for its validity.

Outdated URLs: The datasets available on the internet has been uploaded long time ago, there are new kind of phishing URLs arising in every second.

Outdated Features: The datasets available on the internet has been uploaded long time ago, there are new methodologies arising in phishing techniques.

No Access to Backend: There is no stepwise guide describing how the feature has been derived.

On the contrary we are trying to overcome all the above-mentioned problems.

1. Real Time Data: Before applying a Machine Learning algorithm, we can run the script and fetch real time URLs from Phishtank (for phishing URLs) and from moz (for legitimate URLs) 2. Scalable Data: We can also specify the number of URLs we want to feed the model and hence the web scrapper will fetch that much amount of data from the websites. Presently we are using 1401 URLs in this project i.e. 901 Phishing URLs and 500 Legitimate URLS. 3. New Features: We have tried to implement the prominent new features that is there in the current phishing URLs and since we own the code, new features can also be added. 4. Source code on Github: The source code is published on GitHub for public use and can be used for further scope of improvements. This way there will be transparency to the logic and more creators can add there meaningful additions to the code.

Link to the source code

https://github.com/akshaya1508/detection_of_phishing_websites.git

Inspiration

The idea to develop the dataset and the code for this dataset has been inspired by various other creators who have worked on the similar lines.
Korean language Web pages dataset
figshare.com
txt
Updated Jan 29, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lulwah Alkwai (2017). Korean language Web pages dataset [Dataset]. http://doi.org/10.6084/m9.figshare.4588735.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4588735.v1
Dataset updated
Jan 29, 2017
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Lulwah Alkwai
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains 1,517 URIs with content determined to be in the Korean language. The URIs were collected from DMOZ. All 1,517 URIs were available on the live Web as of December 2015.

This data is used and further described in the journal article: Lulwah M. Alkwai, Michael L. Nelson, and Michele C. Weigle. 2017. Comparing the Archival Rate of Arabic, English, Danish, and Korean Language Web Pages. ACM Transactions on Information Systems (TOIS).

This work was an extension of the paper: Lulwah M. Alkwai, Michael L. Nelson, and Michele C. Weigle. 2015. How Well Are Arabic Websites Archived?. In Proceedings of the 15th IEEE/ACM Joint Conference on Digital Libraries (JCDL). ACM
P
Alexa Domains Dataset
paperswithcode.com
opendatalab.com
Updated Feb 1, 2001
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isaac Corley; Jonathan Lwowski; Justin Hoffman (2001). Alexa Domains Dataset [Dataset]. https://paperswithcode.com/dataset/gagan-bhatia
Explore at:
Dataset updated
Feb 1, 2001
Authors
Isaac Corley; Jonathan Lwowski; Justin Hoffman
Description
This dataset is composed of the URLs of the top 1 million websites. The domains are ranked using the Alexa traffic ranking which is determined using a combination of the browsing behavior of users on the website, the number of unique visitors, and the number of pageviews. In more detail, unique visitors are the number of unique users who visit a website on a given day, and pageviews are the total number of user URL requests for the website. However, multiple requests for the same website on the same day are counted as a single pageview. The website with the highest combination of unique visitors and pageviews is ranked the highest
Z
The Klarna Product-Page Dataset
data.niaid.nih.gov
researchdata.se
+1more
Updated Nov 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moradi, Aref (2024). The Klarna Product-Page Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12605479
Explore at:
Dataset updated
Nov 7, 2024
Dataset provided by
Magureanu, Stefan
Moradi, Aref
Lagergren, Jens
Hotti, Alexandra
Risuleo, Riccardo Sven
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Description

The Klarna Product Page Dataset is a dataset of publicly available pages corresponding to products sold online on various e-commerce websites. The dataset contains offline snapshots of 51,701 product pages collected from 8,175 distinct merchants across 8 different markets (US, GB, SE, NL, FI, NO, DE, AT) between 2018 and 2019. On each page, analysts labelled 5 elements of interest: the price of the product, its image, its name and the add-to-cart and go-to-cart buttons (if found). These labels are present in the HTML code as an attribute called klarna-ai-label taking one of the values: Price, Name, Main picture, Add to cart and Cart.

The snapshots are available in 3 formats: as MHTML files (~24GB), as WebTraversalLibrary (WTL) snapshots (~7.4GB), and as screeshots (~8.9GB). The MHTML format is less lossy, a browser can render these pages though any Javascript on the page is lost. The WTL snapshots are produced by loading the MHTML pages into a chromium-based browser. To keep the WTL dataset compact, the screenshots of the rendered MTHML are provided separately; here we provide the HTML of the rendered DOM tree and additional page and element metadata with rendering information (bounding boxes of elements, font sizes etc.). The folder structure of the screenshot dataset is identical to the one the WTL dataset and can be used to complete the WTL snapshots with image information. For convenience, the datasets are provided with a train/test split in which no merchants in the test set are present in the training set.

Corresponding Publication

For more information about the contents of the datasets (statistics etc.) please refer to the following TMLR paper.

GitHub Repository

The code needed to re-run the experiments in the publication accompanying the dataset can be accessed here.

Citing

If you found this dataset useful in your research, please cite the paper as follows:

@article{hotti2024the, title={The Klarna Product Page Dataset: Web Element Nomination with Graph Neural Networks and Large Language Models}, author={Alexandra Hotti and Riccardo Sven Risuleo and Stefan Magureanu and Aref Moradi and Jens Lagergren}, journal={Transactions on Machine Learning Research}, issn={2835-8856}, year={2024}, url={https://openreview.net/forum?id=zz6FesdDbB}, note={} }
R
Webpage Detection Dataset
universe.roboflow.com
zip
Updated Mar 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
screen2code (2024). Webpage Detection Dataset [Dataset]. https://universe.roboflow.com/screen2code/webpage-detection/model/1
Explore at:
zipAvailable download formats
Dataset updated
Mar 5, 2024
Dataset authored and provided by
screen2code
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Webpage Bounding Boxes
Description
Webpage Detection

## Overview Webpage Detection is a dataset for object detection tasks - it contains Webpage annotations for 6,189 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
P
WebUI Dataset
paperswithcode.com
Updated Mar 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). WebUI Dataset [Dataset]. https://paperswithcode.com/dataset/webui
Explore at:
Dataset updated
Mar 15, 2025
Description
The WebUI dataset contains 400K web UIs captured over a period of 3 months and cost about $500 to crawl. We grouped web pages together by their domain name, then generated training (70%), validation (10%), and testing (20%) splits. This ensured that similar pages from the same website must appear in the same split. We created four versions of the training dataset. Three of these splits were generated by randomly sampling a subset of the training split: Web-7k, Web-70k, Web-350k. We chose 70k as a baseline size, since it is approximately the size of existing UI datasets. We also generated an additional split (Web-7k-Resampled) to provide a small, higher quality split for experimentation. Web-7k-Resampled was generated using a class-balancing sampling technique, and we removed screens with possible visual defects (e.g., very small, occluded, or invisible elements). The validation and test split was always kept the same.
Visitor analytics in city of Helsinki websites
kaggle.com
Updated Dec 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Olaf Laitinen (2024). Visitor analytics in city of Helsinki websites [Dataset]. http://doi.org/10.34740/kaggle/dsv/10342181
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/10342181
Dataset updated
Dec 31, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Olaf Laitinen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Helsinki
Description
Administrator: Helsingin kaupunginkanslia / Digitalisaatioyksikkö

Administrator's webpage: https://www.hel.fi/fi

Published: 10.03.2022

Updated: 02.09.2022

Update frequency: day

Categories: Local government

Tags: visitor counts

Geographical coverage: Helsinki

Time series starts: 2022-01-01

Time series accuracy: month

License: Creative Commons Attribution 4.0

How to reference: Source: Visitor analytics in city of Helsinki websites. The maintainer of the dataset is Helsingin kaupunginkanslia / Digitalisaatioyksikkö. The dataset has been downloaded from Helsinki Region Infoshare service on 31.12.2024 under the license Creative Commons Attribution 4.0.
Most visited websites by hierachycal categories
kaggle.com
Updated Sep 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Natanael de Souza Figueiredo (2020). Most visited websites by hierachycal categories [Dataset]. https://www.kaggle.com/natanael127/most-visited-websites-by-hierachycal-categories/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 18, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Natanael de Souza Figueiredo
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Alexa Internet was founded in April 1996 by Brewster Kahle and Bruce Gilliat. The company's name was chosen in homage to the Library of Alexandria of Ptolemaic Egypt, drawing a parallel between the largest repository of knowledge in the ancient world and the potential of the Internet to become a similar store of knowledge. (from Wikipedia)

The categories list was going out by September, 17h, 2020. So I would like to save it. https://support.alexa.com/hc/en-us/articles/360051913314

This dataset was elaborated by this python script (V2.0): https://github.com/natanael127/dump-alexa-ranking

Content

The sites are grouped in 17 macro categories and this tree ends having more than 360.000 nodes. Subjects are very organized and each of them has its own rank of most accessed domains. So, even the keys of a sub-dictionary may be a good small dataset to use.

Acknowledgements

Thank you my friend André (https://github.com/andrerclaudio) by helping me with tips of Google Colaboratory and computational power to get the data until our deadline.

Inspiration

Alexa ranking was inspired by Library of Alexandria. In the modern world, it may be a good start for AI know more about many, many subjects of the world.
NIST Statistical Reference Datasets - SRD 140
catalog.data.gov
datasets.ai
+2more
Updated Jul 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2022). NIST Statistical Reference Datasets - SRD 140 [Dataset]. https://catalog.data.gov/dataset/nist-statistical-reference-datasets-srd-140-df30c
Explore at:
Dataset updated
Jul 29, 2022
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
The purpose of this project is to improve the accuracy of statistical software by providing reference datasets with certified computational results that enable the objective evaluation of statistical software. Currently datasets and certified values are provided for assessing the accuracy of software for univariate statistics, linear regression, nonlinear regression, and analysis of variance. The collection includes both generated and 'real-world' data of varying levels of difficulty. Generated datasets are designed to challenge specific computations. These include the classic Wampler datasets for testing linear regression algorithms and the Simon & Lesage datasets for testing analysis of variance algorithms. Real-world data include challenging datasets such as the Longley data for linear regression, and more benign datasets such as the Daniel & Wood data for nonlinear regression. Certified values are 'best-available' solutions. The certification procedure is described in the web pages for each statistical method. Datasets are ordered by level of difficulty (lower, average, and higher). Strictly speaking the level of difficulty of a dataset depends on the algorithm. These levels are merely provided as rough guidance for the user. Producing correct results on all datasets of higher difficulty does not imply that your software will pass all datasets of average or even lower difficulty. Similarly, producing correct results for all datasets in this collection does not imply that your software will do the same for your particular dataset. It will, however, provide some degree of assurance, in the sense that your package provides correct results for datasets known to yield incorrect results for some software. The Statistical Reference Datasets is also supported by the Standard Reference Data Program.
R
Web Page Object Detection Dataset
universe.roboflow.com
zip
Updated Mar 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
web page summarizer (2023). Web Page Object Detection Dataset [Dataset]. https://universe.roboflow.com/web-page-summarizer/web-page-object-detection
Explore at:
zipAvailable download formats
Dataset updated
Mar 2, 2023
Dataset authored and provided by
web page summarizer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Web Page Elements Bounding Boxes
Description
Here are a few use cases for this project:

Web Accessibility Improvement: The "Web Page Object Detection" model can be used to identify and label various elements on a web page, making it easier for people with visual impairments to navigate and interact with websites using screen readers and other assistive technologies.

Web Design Analysis: The model can be employed to analyze the structure and layout of popular websites, helping web designers understand best practices and trends in web design. This information can inform the creation of new, user-friendly websites or redesigns of existing pages.

Automatic Web Page Summary Generation: By identifying and extracting key elements, such as titles, headings, content blocks, and lists, the model can assist in generating concise summaries of web pages, which can aid users in their search for relevant information.

Web Page Conversion and Optimization: The model can be used to detect redundant or unnecessary elements on a web page and suggest their removal or modification, leading to cleaner designs and faster-loading pages. This can improve user experience and, potentially, search engine rankings.

Assisting Web Developers in Debugging and Testing: By detecting web page elements, the model can help identify inconsistencies or errors in a site's code or design, such as missing or misaligned elements, allowing developers to quickly diagnose and address these issues.
d
Recently Orphaned Newspapers: From Archived Webpages to Reusable Datasets...
data.depositar.io
markdown, pdf, png
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
depositar (2025). Recently Orphaned Newspapers: From Archived Webpages to Reusable Datasets and Research Outlooks [Dataset]. https://data.depositar.io/dataset/recently-orphaned-newspapers
Explore at:
png(105119), png(583480), png(100000), markdown(13411), pdf(921808), pdf(3026940)Available download formats
Dataset updated
Apr 17, 2025
Dataset provided by
depositar
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Update: (2025-04-17) Also added to this dataset are the presentation slides and script used at the Web Archiving Conference 2025 (WAC2025) on 2025-04-10.

Collected in this dataset are the abstract and related materials prepared for a submission to The 2025 General Assembly (GA) and Web Archiving Conference (WAC) . The abstract has been accepted for a 15-minute presentation with a 5-minute Q&A at the conference which is to be held at the National Library of Norway in Oslo from 8-10 April 2025.

The full abstract (in PDF) and the figures (in PNG) are collected into this dataset. The text from the abstract is also copied below.

Recently Orphaned Newspapers: From Archived Webpages to Reusable Datasets and Research Outlooks

2024-09-17

Tyng-Ruey Chuang
Chia-Hsun Wang
Hung-Yen Wu

Topics:

IMPROVING DISCOVERY & ACCESS

Keywords:

Orphan Works

Online News

IPTC (International Press Telecommunications Council)

ninjs: News in JSON

FAIR Data

Traditional Chinese Language Resources

We report on our progress in converting the web archives of a recently orphaned newspaper into accessible article collections in IPTC (International Press Telecommunications Council) standard format for news representation. After the conversion, old articles extracted from a defunct news website are now reincarnated as research datasets meeting the FAIR (Findable, Accessible, Interoperable, and Reusable) data principles. Specifically, we focus on Taiwan's Apple Daily and work on the WARC files built by the Archive Team in September 2022 at a time when the future of the newspaper seemed dim. We convert these WARC files into de-duplicated collections of pure text in ninjs (News in JSON) format.

The Apple Daily in Taiwan had been in publication since 2003 but discontinued its print edition in May 2021. In August 2022, its online edition was no longer being updated, and the entire news website has become inaccessible since March 2023. The fate of Taiwan's Apple Daily followed that of its (elder) sister publication in Hong Kong. The Apple Daily in Hong Kong was forced to cease its entire operation after midnight June 23, 2021. Its pro-democracy founder, Jimmy Lai (黎智英), was arrested under Hong Kong's security law the year before.

Being orphaned and offline, past reports and commentaries from the newspapers on contemporary events (e.g. the Sunflower Movement in Taiwan and the Umbrella Movement in Hong Kong) become unavailable to the general public. Such inaccessibility has impacts on education (e.g. fewer news sources to be edited into Wikipedia), research (e.g. fewer materials to study the early 2000s zeitgeist in Hong Kong and Taiwan), and knowledge production (e.g. fewer traditional Chinese corpora to work with).

Our work in transforming the WARC records into ninjs objects produces a collection of unique 953,175 news articles totaling in 4.3 GB. The articles are grouped by the day/month/year they were published hence it is convenient to look into a specific date for the news that were published on that day. Metadata about each article — headline(s), subject(s), original URI, unique ID, among others — are mapped into the corresponding fields in the ninjs object for ready access.

Figure 1 shows the ninjs object derived from a news article that was published on 2014-03-19, archived on 2021-09-29, and converted by us on 2024-02-17. Figure 2 is a screenshot of the webpage where the news was originally published, as kept in the WayBack Machine of the Internet Archive. Figure 3 displays the text file of the ninjs object in Figure 1 (noted that character strings are expressed in JSON's escaped Unicode format). Currently the images and videos accompanying the news article have not been extracted. This is evident in that the video (playable in the WayBack Machine) shown in Figure 2 is missing in the ninjs object in Figure 1. Another process is in the plan to preserve and link to these media files in the produced ninjs object.

In our presentation, we shall elaborate on the technical details (such as the accuracy and coverage of the conversion) and exemplary use cases of the collection. We will touch on the roles of public research organizations in preserving and making available materials that are deemed out of commerce and circulation.
c
Images dataset from multiple sources
crawlfeeds.com
zip
Updated Jan 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). Images dataset from multiple sources [Dataset]. https://crawlfeeds.com/datasets/images-dataset-from-multiple-sources
Explore at:
zipAvailable download formats
Dataset updated
Jan 17, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
Images files download from different sites like walmart, amazon, instacart, gopuff, target and kroger.

Dataset not included any schema

Images extracted from the different categories its included coffee, cups, beer, filters and cat food.

Total images count: 12K

Image formats: JPEG, JPG and PNG
Z
Analysing state-backed propaganda websites: a new dataset and linguistic...
data.niaid.nih.gov
Updated Dec 14, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heppell, Freddy (2023). Analysing state-backed propaganda websites: a new dataset and linguistic study (public dataset) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10007382
Explore at:
Dataset updated
Dec 14, 2023
Dataset provided by
Heppell, Freddy
Bontcheva, Kalina
Scarton, Carolina
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the dataset accompanying the EMNLP 2023 paper "Analysing state-backed propaganda websites: a new dataset and linguistic study".

For copyright and liability reasons, we do not publicly distribute the complete dataset. Instead, we provide the software used to create the dataset (DOI: 10.5281/zenodo.10008086) and a list containing the URLs of all the posts in the full dataset (this repository).

To reconstruct our dataset: use the software to extract the sites, then filter the posts to the corresponding URL list. Please note that some posts may no longer be available or may have been modified.

If you are researching disinformation, propaganda, or a relevant field: please contact the authors, we may be able to provide you with the original dataset.
P
Data from: WebText Dataset
paperswithcode.com
Updated Jul 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alec Radford; Jeffrey Wu; Rewon Child; David Luan; Dario Amodei; Ilya Sutskever (2022). WebText Dataset [Dataset]. https://paperswithcode.com/dataset/webtext
Explore at:
Dataset updated
Jul 10, 2022
Authors
Alec Radford; Jeffrey Wu; Rewon Child; David Luan; Dario Amodei; Ilya Sutskever
Description
WebText is an internal OpenAI corpus created by scraping web pages with emphasis on document quality. The authors scraped all outbound links from Reddit which received at least 3 karma. The authors used the approach as a heuristic indicator for whether other users found the link interesting, educational, or just funny.

WebText contains the text subset of these 45 million links. It consists of over 8 million documents for a total of 40 GB of text. All Wikipedia documents were removed from WebText since it is a common data source for other datasets.
z
Requirements data sets (user stories)
zenodo.org
data.mendeley.com
txt
Updated Jan 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabiano Dalpiaz; Fabiano Dalpiaz (2025). Requirements data sets (user stories) [Dataset]. http://doi.org/10.17632/7zbk8zsd8y.1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.17632/7zbk8zsd8y.1
Dataset updated
Jan 13, 2025
Dataset provided by
Mendeley Data
Authors
Fabiano Dalpiaz; Fabiano Dalpiaz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A collection of 22 data set of 50+ requirements each, expressed as user stories.

The dataset has been created by gathering data from web sources and we are not aware of license agreements or intellectual property rights on the requirements / user stories. The curator took utmost diligence in minimizing the risks of copyright infringement by using non-recent data that is less likely to be critical, by sampling a subset of the original requirements collection, and by qualitatively analyzing the requirements. In case of copyright infringement, please contact the dataset curator (Fabiano Dalpiaz, f.dalpiaz@uu.nl) to discuss the possibility of removal of that dataset [see Zenodo's policies]

The data sets have been originally used to conduct experiments about ambiguity detection with the REVV-Light tool: https://github.com/RELabUU/revv-light

This collection has been originally published in Mendeley data: https://data.mendeley.com/datasets/7zbk8zsd8y/1

Overview of the datasets [data and links added in December 2024]

The following text provides a description of the datasets, including links to the systems and websites, when available. The datasets are organized by macro-category and then by identifier.

Public administration and transparency

g02-federalspending.txt (2018) originates from early data in the Federal Spending Transparency project, which pertain to the website that is used to share publicly the spending data for the U.S. government. The website was created because of the Digital Accountability and Transparency Act of 2014 (DATA Act). The specific dataset pertains a system called DAIMS or Data Broker, which stands for DATA Act Information Model Schema. The sample that was gathered refers to a sub-project related to allowing the government to act as a data broker, thereby providing data to third parties. The data for the Data Broker project is currently not available online, although the backend seems to be hosted in GitHub under a CC0 1.0 Universal license. Current and recent snapshots of federal spending related websites, including many more projects than the one described in the shared collection, can be found here.

g03-loudoun.txt (2018) is a set of extracted requirements from a document, by the Loudoun County Virginia, that describes the to-be user stories and use cases about a system for land management readiness assessment called Loudoun County LandMARC. The source document can be found here and it is part of the Electronic Land Management System and EPlan Review Project - RFP RFQ issued in March 2018. More information about the overall LandMARC system and services can be found here.

g04-recycling.txt(2017) concerns a web application where recycling and waste disposal facilities can be searched and located. The application operates through the visualization of a map that the user can interact with. The dataset has obtained from a GitHub website and it is at the basis of a students' project on web site design; the code is available (no license).

g05-openspending.txt (2018) is about the OpenSpending project (www), a project of the Open Knowledge foundation which aims at transparency about how local governments spend money. At the time of the collection, the data was retrieved from a Trello board that is currently unavailable. The sample focuses on publishing, importing and editing datasets, and how the data should be presented. Currently, OpenSpending is managed via a GitHub repository which contains multiple sub-projects with unknown license.

g11-nsf.txt (2018) refers to a collection of user stories referring to the NSF Site Redesign & Content Discovery project, which originates from a publicly accessible GitHub repository (GPL 2.0 license). In particular, the user stories refer to an early version of the NSF's website. The user stories can be found as closed Issues.

(Research) data and meta-data management

g08-frictionless.txt (2016) regards the Frictionless Data project, which offers an open source dataset for building data infrastructures, to be used by researchers, data scientists, and data engineers. Links to the many projects within the Frictionless Data project are on GitHub (with a mix of Unlicense and MIT license) and web. The specific set of user stories has been collected in 2016 by GitHub user @danfowler and are stored in a Trello board.

g14-datahub.txt (2013) concerns the open source project DataHub, which is currently developed via a GitHub repository (the code has Apache License 2.0). DataHub is a data discovery platform which has been developed over multiple years. The specific data set is an initial set of user stories, which we can date back to 2013 thanks to a comment therein.

g16-mis.txt (2015) is a collection of user stories that pertains a repository for researchers and archivists. The source of the dataset is a public Trello repository. Although the user stories do not have explicit links to projects, it can be inferred that the stories originate from some project related to the library of Duke University.

g17-cask.txt (2016) refers to the Cask Data Application Platform (CDAP). CDAP is an open source application platform (GitHub, under Apache License 2.0) that can be used to develop applications within the Apache Hadoop ecosystem, an open-source framework which can be used for distributed processing of large datasets. The user stories are extracted from a document that includes requirements regarding dataset management for Cask 4.0, which includes the scenarios, user stories and a design for the implementation of these user stories. The raw data is available in the following environment.

g18-neurohub.txt (2012) is concerned with the NeuroHub platform, a neuroscience data management, analysis and collaboration platform for researchers in neuroscience to collect, store, and share data with colleagues or with the research community. The user stories were collected at a time NeuroHub was still a research project sponsored by the UK Joint Information Systems Committee (JISC). For information about the research project from which the requirements were collected, see the following record.

g22-rdadmp.txt (2018) is a collection of user stories from the Research Data Alliance's working group on DMP Common Standards. Their GitHub repository contains a collection of user stories that were created by asking the community to suggest functionality that should part of a website that manages data management plans. Each user story is stored as an issue on the GitHub's page.

g23-archivesspace.txt (2012-2013) refers to ArchivesSpace: an open source, web application for managing archives information. The application is designed to support core functions in archives administration such as accessioning; description and arrangement of processed materials including analog, hybrid, and
born digital content; management of authorities and rights; and reference service. The application supports collection management through collection management records, tracking of events, and a growing number of administrative reports. ArchivesSpace is open source and its
Website Screenshots Dataset
universe.roboflow.com
kaggle.com
zip
Updated Aug 19, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roboflow (2022). Website Screenshots Dataset [Dataset]. https://universe.roboflow.com/roboflow-gw7yv/website-screenshots
Explore at:
zipAvailable download formats
Dataset updated
Aug 19, 2022
Dataset authored and provided by
Roboflowhttps://roboflow.com/
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Variables measured
Elements Bounding Boxes
Description
About This Dataset

The Roboflow Website Screenshots dataset is a synthetically generated dataset composed of screenshots from over 1000 of the world's top websites. They have been automatically annotated to label the following classes: :fa-spacer: * button - navigation links, tabs, etc. * heading - text that was enclosed in <h1> to <h6> tags. * link - inline, textual <a> tags. * label - text labeling form fields. * text - all other text. * image - <img>, <svg>, or <video> tags, and icons. * iframe - ads and 3rd party content.

Example

This is an example image and annotation from the dataset: https://i.imgur.com/mOG3u3Z.png" alt="WIkipedia Screenshot">

Usage

Annotated screenshots are very useful in Robotic Process Automation. But they can be expensive to label. This dataset would cost over $4000 for humans to label on popular labeling services. We hope this dataset provides a good starting point for your project. Try it with a model from our model library.

Collecting Custom Data

Roboflow is happy to provide a custom screenshots dataset to meet your particular needs. We can crawl public or internal web applications. Just reach out and we'll be happy to provide a quote!

About Roboflow

Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless. :fa-spacer: Developers reduce 50% of their boilerplate code when using Roboflow's workflow, save training time, and increase model reproducibility. :fa-spacer:
P
WebKB Dataset
paperswithcode.com
opendatalab.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mark Craven; Dan DiPasquo; Dayne Freitag; Andrew McCallum; Tom M. Mitchell; Kamal Nigam; Seán Slattery, WebKB Dataset [Dataset]. https://paperswithcode.com/dataset/webkb
Explore at:
Authors
Mark Craven; Dan DiPasquo; Dayne Freitag; Andrew McCallum; Tom M. Mitchell; Kamal Nigam; Seán Slattery
Description
WebKB is a dataset that includes web pages from computer science departments of various universities. 4,518 web pages are categorized into 6 imbalanced categories (Student, Faculty, Staff, Department, Course, Project). Additionally there is Other miscellanea category that is not comparable to the rest.
h
VisualWebBench
huggingface.co
Updated Apr 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VisualWebBench (2024). VisualWebBench [Dataset]. https://huggingface.co/datasets/visualwebbench/VisualWebBench
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 5, 2024
Dataset authored and provided by
VisualWebBench
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
VisualWebBench

Dataset for the paper: VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? 🌐 Homepage | 🐍 GitHub | 📖 arXiv

Introduction

We introduce VisualWebBench, a multimodal benchmark designed to assess the understanding and grounding capabilities of MLLMs in web scenarios. VisualWebBench consists of seven tasks, and comprises 1.5K human-curated instances from 139 real websites, covering 87 sub-domains. We evaluate 14… See the full description on the dataset page: https://huggingface.co/datasets/visualwebbench/VisualWebBench.

Facebook

Twitter

Click to copy link

Link copied

Cite

Lincolnshire County Council (2018). Website Statistics [Dataset]. https://data.wu.ac.at/schema/data_gov_uk/M2ZkZDBjOTUtMzNhYi00YWRjLWI1OWMtZmUzMzA5NjM0ZTdk

Website Statistics

Explore at:

csv, pdfAvailable download formats

Dataset updated

Jun 11, 2018

Dataset provided by

Lincolnshire County Councilhttp://www.lincolnshire.gov.uk/

License

Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically

Description

This Website Statistics dataset has four resources showing usage of the Lincolnshire Open Data website. Web analytics terms used in each resource are defined in their accompanying Metadata file.

Website Usage Statistics: This document shows a statistical summary of usage of the Lincolnshire Open Data site for the latest calendar year.
Website Statistics Summary: This dataset shows a website statistics summary for the Lincolnshire Open Data site for the latest calendar year.
Webpage Statistics: This dataset shows statistics for individual Webpages on the Lincolnshire Open Data site by calendar year.
Dataset Statistics: This dataset shows cumulative totals for Datasets on the Lincolnshire Open Data site that have also been published on the national Open Data site Data.Gov.UK - see the Source link.

Note: Website and Webpage statistics (the first three resources above) show only UK users, and exclude API calls (automated requests for datasets). The Dataset Statistics are confined to users with javascript enabled, which excludes web crawlers and API calls.

These Website Statistics resources are updated annually in January by the Lincolnshire County Council Business Intelligence team. For any enquiries about the information contact opendata@lincolnshire.gov.uk.

Clear search

Close search

Google apps

Main menu

Website Statistics

CoVA Dataset

Phishing Websites Detection

Context

Content

Link to the source code

Inspiration

Korean language Web pages dataset

Alexa Domains Dataset

The Klarna Product-Page Dataset

Webpage Detection Dataset

Webpage Detection

WebUI Dataset

Visitor analytics in city of Helsinki websites

Most visited websites by hierachycal categories

Context

Content

Acknowledgements

Inspiration

NIST Statistical Reference Datasets - SRD 140

Web Page Object Detection Dataset

Recently Orphaned Newspapers: From Archived Webpages to Reusable Datasets...

Recently Orphaned Newspapers: From Archived Webpages to Reusable Datasets and Research Outlooks

Images dataset from multiple sources

Analysing state-backed propaganda websites: a new dataset and linguistic...

Data from: WebText Dataset

Requirements data sets (user stories)

Overview of the datasets [data and links added in December 2024]

Public administration and transparency

(Research) data and meta-data management

Website Screenshots Dataset

About This Dataset

Example

Usage

Collecting Custom Data

About Roboflow

WebKB Dataset

VisualWebBench

Website Statistics