Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
This Dataset is a collection of HTML files that include examples of Phishing websites and Non-Phishing Websites and can be used to build Classification models on the website content. I created this dataset as a part of my Practicum project for my Masters in Cybersecurity from Georgia Tech.
Cover Photo Source: Photo by Clive Kim from Pexels: https://www.pexels.com/photo/fishing-sea-dawn-landscape-5887837/
This dataset contains a collection of FacebookAds-related queries and responses generated by an AI assistant.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 2 rows and is filtered where the book is Beginning Web programming with HTML, XHTML, and CSS. It features 5 columns: author, publication date, book publisher, and BNB id.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here are a few use cases for this project:
Web Accessibility Improvement: The "Web Page Object Detection" model can be used to identify and label various elements on a web page, making it easier for people with visual impairments to navigate and interact with websites using screen readers and other assistive technologies.
Web Design Analysis: The model can be employed to analyze the structure and layout of popular websites, helping web designers understand best practices and trends in web design. This information can inform the creation of new, user-friendly websites or redesigns of existing pages.
Automatic Web Page Summary Generation: By identifying and extracting key elements, such as titles, headings, content blocks, and lists, the model can assist in generating concise summaries of web pages, which can aid users in their search for relevant information.
Web Page Conversion and Optimization: The model can be used to detect redundant or unnecessary elements on a web page and suggest their removal or modification, leading to cleaner designs and faster-loading pages. This can improve user experience and, potentially, search engine rankings.
Assisting Web Developers in Debugging and Testing: By detecting web page elements, the model can help identify inconsistencies or errors in a site's code or design, such as missing or misaligned elements, allowing developers to quickly diagnose and address these issues.
Dataset Name:
Web File Structure Dataset
Description:
This dataset is designed to train AI models on best practices for organizing files in web development projects. It includes 100,000 examples that cover the structure and conventions of HTML, CSS, JavaScript, and other web-related files. Each example consists of a prompt and a corresponding completion, providing comprehensive guidance on how to organize web project files effectively.
Key Features:… See the full description on the dataset page: https://huggingface.co/datasets/Juliankrg/Web_FileStructure_DataSet_100k.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Detect_web_element is a dataset for object detection tasks - it contains Content annotations for 1,206 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset was extracted for a study on the evolution of Web search engine interfaces since their appearance. The well-known list of “10 blue links” has evolved into richer interfaces, often personalized to the search query, the user, and other aspects. We used the most searched queries by year to extract a representative sample of SERP from the Internet Archive. The Internet Archive has been keeping snapshots and the respective HTML version of webpages over time and tts collection contains more than 50 billion webpages. We used Python and Selenium Webdriver, for browser automation, to visit each capture online, check if the capture is valid, save the HTML version, and generate a full screenshot. The dataset contains all the extracted captures. Each capture is represented by a screenshot, an HTML file, and a files' folder. We concatenate the initial of the search engine (G) with the capture's timestamp for file naming. The filename ends with a sequential integer "-N" if the timestamp is repeated. For example, "G20070330145203-1" identifies a second capture from Google by March 30, 2007. The first is identified by "G20070330145203". Using this dataset, we analyzed how SERP evolved in terms of content, layout, design (e.g., color scheme, text styling, graphics), navigation, and file size. We have registered the appearance of SERP features and analyzed the design patterns involved in each SERP component. We found that the number of elements in SERP has been rising over the years, demanding a more extensive interface area and larger files. This systematic analysis portrays evolution trends in search engine user interfaces and, more generally, web design. We expect this work will trigger other, more specific studies that can take advantage of the dataset we provide here. This graphic represents the diversity of captures by year and search engine (Google and Bing).
The CalFish Abundance Database contains a comprehensive collection of anadromous fisheries abundance information. Beginning in 1998, the Pacific States Marine Fisheries Commission, the California Department of Fish and Game, and the National Marine Fisheries Service, began a cooperative project aimed at collecting, archiving, and entering into standardized electronic formats, the wealth of information generated by fisheries resource management agencies and tribes throughout California.Extensive data are currently available for chinook, coho, and steelhead. Major data categories include adult abundance population estimates, actual fish and/or carcass counts, counts of fish collected at dams, weirs, or traps, and redd counts. Harvest data has been compiled for many streams, and hatchery return data has been compiled for the states mitigation facilities. A draft format has been developed for juvenile abundance and awaits final approval. This CalFish Abundance Database shapefile was generated from fully routed 1:100,000 hydrography. In a few cases streams had to be added to the hydrography dataset in order to provide a means to create shapefiles to represent abundance data associated with them. Streams added were digitized at no more than 1:24,000 scale based on stream line images portrayed in 1:24,000 Digital Raster Graphics (DRG).These features generally represent abundance counts resulting from stream surveys. The linear features in this layer typically represent the location for which abundance data records apply. This would be the reach or length of stream surveyed, or the stream sections for which a given population estimate applies. In some cases the actual stream section surveyed was not specified and linear features represent the entire stream. In many cases there are multiple datasets associated with the same length of stream, and so, linear features overlap. Please view the associated datasets for detail regarding specific features. In CalFish these are accessed through the "link" that is visible when performing an identify or query operation. A URL string is provided with each feature in the downloadable data which can also be used to access the underlying datasets.The coho data that is available via the CalFish website is actually linked directly to the StreamNet website where the database's tabular data is currently stored. Additional information about StreamNet may be downloaded at http://www.streamnet.org. Complete documentation for the StreamNet database may be accessed at http://http://www.streamnet.org/def.html
The problem is to predict user ratings for web pages (within a subject category). The HTML source of a web page is given. Users looked at each web page and indicated on a 3 point scale (hot medium cold) 50-100 pages per domain.
This database contains HTML source of web pages plus the ratings of a single user on these web pages. Web pages are on four separate subjects (Bands- recording artists; Goats; Sheep; and BioMedical).
Data originally from the UCI ML Repository. Donated by:
Michael Pazzani Department of Information and Computer Science, University of California, Irvine Irvine, CA 92697-3425 pazzani@ics.uci.edu
Concept based Information Access with Google for Personalized Information Retrieval
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The Klarna Product Page Dataset is a dataset of publicly available pages corresponding to products sold online on various e-commerce websites. The dataset contains offline snapshots of 51,701 product pages collected from 8,175 distinct merchants across 8 different markets (US, GB, SE, NL, FI, NO, DE, AT) between 2018 and 2019. On each page, analysts labelled 5 elements of interest: the price of the product, its image, its name and the add-to-cart and go-to-cart buttons (if found). These labels are present in the HTML code as an attribute called klarna-ai-label taking one of the values: Price, Name, Main picture, Add to cart and Cart.
The snapshots are available in 3 formats: as MHTML files (~24GB), as WebTraversalLibrary (WTL) snapshots (~7.4GB), and as screeshots (~8.9GB). The MHTML format is less lossy, a browser can render these pages though any Javascript on the page is lost. The WTL snapshots are produced by loading the MHTML pages into a chromium-based browser. To keep the WTL dataset compact, the screenshots of the rendered MTHML are provided separately; here we provide the HTML of the rendered DOM tree and additional page and element metadata with rendering information (bounding boxes of elements, font sizes etc.). The folder structure of the screenshot dataset is identical to the one the WTL dataset and can be used to complete the WTL snapshots with image information. For convenience, the datasets are provided with a train/test split in which no merchants in the test set are present in the training set.
Corresponding Publication
For more information about the contents of the datasets (statistics etc.) please refer to the following TMLR paper.
GitHub Repository
The code needed to re-run the experiments in the publication accompanying the dataset can be accessed here.
Citing
If you found this dataset useful in your research, please cite the paper as follows:
@article{hotti2024the, title={The Klarna Product Page Dataset: Web Element Nomination with Graph Neural Networks and Large Language Models}, author={Alexandra Hotti and Riccardo Sven Risuleo and Stefan Magureanu and Aref Moradi and Jens Lagergren}, journal={Transactions on Machine Learning Research}, issn={2835-8856}, year={2024}, url={https://openreview.net/forum?id=zz6FesdDbB}, note={} }
WebSRC is a novel Web-based Structural Reading Comprehension dataset. It consists of 0.44M question-answer pairs, which are collected from 6.5K web pages with corresponding HTML source code, screenshots and metadata. Each question in WebSRC requires a certain structural understanding of a web page to answer, and the answer is either a text span on the web page or yes/no.
U.S. Department of Veterans Affairs Freedom of Information Act Service Webpage with many links to associated information.
The State Cancer Profiles (SCP) web site provides statistics to help guide and prioritize cancer control activities at the state and local levels. SCP is a collaborative effort using local and national level cancer data from the Centers for Disease Control and Prevention's National Program of Cancer Registries (NPCR) and National Cancer Institute's Surveillance, Epidemiology and End Results Registries (SEER). SCP address select types of cancer and select behavioral risk factors for which there are evidence-based control interventions. The site provides incidence, mortality and prevalence comparison tables as well as interactive graphs and maps and support data. The graphs and maps provide visual support for deciding where to focus cancer control efforts.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset originally created 03/01/2019 UPDATE: Packaged on 04/18/2019 UPDATE: Edited README on 04/18/2019
I. About this Data Set This data set is a snapshot of work that is ongoing as a collaboration between Kluge Fellow in Digital Studies, Patrick Egan and an intern at the Library of Congress in the American Folklife Center. It contains a combination of metadata from various collections that contain audio recordings of Irish traditional music. The development of this dataset is iterative, and it integrates visualizations that follow the key principles of trust and approachability. The project, entitled, “Connections In Sound” invites you to use and re-use this data.
The text available in the Items dataset is generated from multiple collections of audio material that were discovered at the American Folklife Center. Each instance of a performance was listed and “sets” or medleys of tunes or songs were split into distinct instances in order to allow machines to read each title separately (whilst still noting that they were part of a group of tunes). The work of the intern was then reviewed before publication, and cross-referenced with the tune index at www.irishtune.info. The Items dataset consists of just over 1000 rows, with new data being added daily in a separate file.
The collections dataset contains at least 37 rows of collections that were located by a reference librarian at the American Folklife Center. This search was complemented by searches of the collections by the scholar both on the internet at https://catalog.loc.gov and by using card catalogs.
Updates to these datasets will be announced and published as the project progresses.
II. What’s included? This data set includes:
III. How Was It Created? These data were created by a Kluge Fellow in Digital Studies and an intern on this program over the course of three months. By listening, transcribing, reviewing, and tagging audio recordings, these scholars improve access and connect sounds in the American Folklife Collections by focusing on Irish traditional music. Once transcribed and tagged, information in these datasets is reviewed before publication.
IV. Data Set Field Descriptions
IV
a) Collections dataset field descriptions
b) Items dataset field descriptions
V. Rights statement The text in this data set was created by the researcher and intern and can be used in many different ways under creative commons with attribution. All contributions to Connections In Sound are released into the public domain as they are created. Anyone is free to use and re-use this data set in any way they want, provided reference is given to the creators of these datasets.
VI. Creator and Contributor Information
Creator: Connections In Sound
Contributors: Library of Congress Labs
VII. Contact Information Please direct all questions and comments to Patrick Egan via www.twitter.com/drpatrickegan or via his website at www.patrickegan.org. You can also get in touch with the Library of Congress Labs team via LC-Labs@loc.gov.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dynamic face-to-face interaction networks represent the interactions that happen during discussions between a group of participants playing the Resistance game. This dataset contains networks extracted from 62 games. Each game is played by 5-8 participants and lasts between 45--60 minutes. We extract dynamically evolving networks from the free-form discussions using the ICAF algorithm. The extracted networks are used to characterize and detect group deceptive behavior using the DeceptionRank algorithm.
The networks are weighted, directed and temporal. Each node represents a participant. At each 1/3 second, a directed edge from node u to v is weighted by the probability of participant u looking at participant v or the laptop. Additionally, we also provide a binary version where an edge from u to v indicates participant u looks at participant v (or the laptop).
Stanford Network Analysis Platform (SNAP) is a general purpose, high performance system for analysis and manipulation of large networks. Graphs consists of nodes and directed/undirected/multiple edges between the graph nodes. Networks are graphs with data on nodes and/or edges of the network.
The core SNAP library is written in C++ and optimized for maximum performance and compact graph representation. It easily scales to massive networks with hundreds of millions of nodes, and billions of edges. It efficiently manipulates large graphs, calculates structural properties, generates regular and random graphs, and supports attributes on nodes and edges. Besides scalability to large graphs, an additional strength of SNAP is that nodes, edges and attributes in a graph or a network can be changed dynamically during the computation.
SNAP was originally developed by Jure Leskovec in the course of his PhD studies. The first release was made available in Nov, 2009. SNAP uses a general purpose STL (Standard Template Library)-like library GLib developed at Jozef Stefan Institute. SNAP and GLib are being actively developed and used in numerous academic and industrial projects.
The "Phishing Data" dataset is a comprehensive collection of information specifically curated for analyzing and understanding phishing attacks. Phishing attacks involve malicious attempts to deceive individuals or organizations into disclosing sensitive information such as passwords or credit card details. This dataset comprises 18 distinct features that offer valuable insights into the characteristics of phishing attempts. These features include the URL of the website being analyzed, the length of the URL, the use of URL shortening services, the presence of the "@" symbol, the presence of redirection using "//", the presence of prefixes or suffixes in the URL, the number of subdomains, the usage of secure connection protocols (HTTPS), the length of time since domain registration, the presence of a favicon, the presence of HTTP or HTTPS tokens in the domain name, the URL of requested external resources, the presence of anchors in the URL, the number of hyperlinks in HTML tags, the server form handler used, the submission of data to email addresses, abnormal URL patterns, and estimated website traffic or popularity. Together, these features enable the analysis and detection of phishing attempts in the "Phishing Data" dataset, aiding in the development of models and algorithms to combat phishing attacks.
Product Page is a large-scale and realistic dataset of webpages. The dataset contains 51,701 manually labeled product pages from 8,175 real e-commerce websites. The pages can be rendered entirely in a web browser and are suitable for computer vision applications. This makes it substantially richer and more diverse than other datasets proposed for element representation learning, classification and prediction on the web.
This is a dataset about the usage of properties and datatypes in the Web Data Commons RDFa, Microdata, Embedded JSON-LD, and Microformats Data Sets (October 2021) based on the Common Crawl October 2021 archive. The dataset has been produced using the RDF Property and Datatype Usage Scanner v2.1.1, which is based on the Apache Jena framework. Only RDFa and embedded JSON-LD data were considered, as Microdata and Microformats do not incorporate explicit datatypes. Dataset Properties Size: 0.2 GiB compressed, 4.4 GiB uncompressed, 20 361 829 rows plus 1 head line determined using gunzip -c measurements.csv.gz | wc -l Parsing Failures: The scanner failed to parse 45 833 332 triples (~0.1 %) of the source dataset (containing 38 812 275 607 triples). Content: CATEGORY: The category (html-embedded-jsonld or html-rdfa) of the Web Data Commons file that has been measured. FILE_URL: The URL of the Web Data Commons file that has been measured. MEASUREMENT: The applied measurement with specific conditions, one of: UnpreciseRepresentableInDouble: The number of lexicals that are in the lexical space but not in the value space of xsd:double. UnpreciseRepresentableInFloat: The number of lexicals that are in the lexical space but not in the value space of xsd:float. UsedAsDatatype: The total number of literals with the datatype. UsedAsPropertyRange: The number of statements that specify the datatype as range of the property. ValidDateNotation: The number of lexicals that are in the lexical space of xsd:date. ValidDateTimeNotation: The number of lexicals that are in the lexical space of xsd:dateTime. ValidDecimalNotation: The number of lexicals that represent a number with decimal notation and whose lexical representation is thereby in the lexical space of xsd:decimal, xsd:float, and xsd:double. ValidExponentialNotation: The number of lexicals that represent a number with exponential notation and whose lexical representation is thereby in the lexical space of xsd:float, and xsd:double. ValidInfOrNaNNotation: The number of lexicals that equals either INF, +INF, -INF or NaN and whose lexical representation is thereby in the lexical space of xsd:float, and xsd:double. ValidIntegerNotation: The number of lexicals that represent an integer number and whose lexical representation is thereby in the lexical space of xsd:integer, xsd:decimal, xsd:float, and xsd:double. ValidTimeNotation: The number of lexicals that are in the lexical space of xsd:time. ValidTrueOrFalseNotation: The number of lexicals that equal either true or false and whose lexical representation is thereby in the lexical space of xsd:boolean. ValidZeroOrOneNotation: The number of lexicals that equal either 0 or 1 and whose lexical representation is thereby in the lexical space of xsd:boolean, and xsd:integer, xsd:decimal, xsd:float, and xsd:double. Note: Lexical representation of xsd:double values in embedded JSON-LD got normalized to always use exponential notation with up to 16 fractional digits (see related code). Be careful by drawing conclusions from according Valid… and Unprecise… measures. PROPERTY: The property that has been measured. DATATYPE: The datatype that has been measured. QUANTITY: The count of statements that fulfill the condition specified by the measurement per file, property and datatype. Preview "CATEGORY","FILE_URL","MEASUREMENT","PROPERTY","DATATYPE","QUANTITY" "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2021-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","https://www.w3.org/2006/vcard/ns#longitude","https://www.w3.org/2001/XMLSchema#float","1" "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2021-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","https://www.w3.org/2006/vcard/ns#latitude","https://www.w3.org/2001/XMLSchema#float","1" "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2021-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","https://purl.org/goodrelations/v1#hasCurrencyValue","https://www.w3.org/2001/XMLSchema#float","6" … "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2021-12/quads/dpef.html-embedded-jsonld.nq-06239.gz","ValidZeroOrOneNotation","http://schema.org/ratingValue","http://www.w3.org/2001/XMLSchema#integer","96" "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2021-12/quads/dpef.html-embedded-jsonld.nq-06239.gz","ValidZeroOrOneNotation","http://schema.org/minValue","http://www.w3.org/2001/XMLSchema#integer","164" "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2021-12/quads/dpef.html-embedded-jsonld.nq-06239.gz","ValidZeroOrOneNotation","http://schema.org/width","http://www.w3.org/2001/XMLSchema#integer","361" Note: The data contain malformed IRIs, like "xsd:dateTime" (instead of probably "http://www.w3.org/2001/XMLSchema#dateTime"), which are caused by missing namespace definitions ...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Motivation: Entity Matching is the task of determining which records from different data sources describe the same real-world entity. It is an important task for data integration and has been the focus of many research works. A large number of entity matching/record linkage tasks has been made available for evaluating entity matching methods. However, the lack of fixed development and test splits as well as correspondence sets including both matching and non-matching record pairs hinders the reproducibility and comparability of benchmark experiments. In an effort to enhance the reproducibility and comparability of the experiments, we complement existing entity matching benchmark tasks with fixed sets of non-matching pairs as well as fixed development and test splits. Dataset Description: An augmented version of the wdc phones dataset for benchmarking entity matching/record linkage methods found at:http://webdatacommons.org/productcorpus/index.html#toc4 The augmented version adds fixed splits for training, validation and testing as well as their corresponding feature vectors. The feature vectors are built using data type specific similarity metrics.The dataset contains 447 records describing products deriving from 17 e-shops which are matched against a product catalog of 50 products. The gold standards have manual annotations for 258 matching and 22,092 non-matching pairs. The total number of attributes used to decribe the product records are 26 while the attribute density is 0.25. The augmented dataset enhances the reproducibility of matching methods and the comparability of matching results. The dataset is part of the CompERBench repository which provides 21 complete benchmark tasks for entity matching for public download: http://data.dws.informatik.uni-mannheim.de/benchmarkmatchingtasks/index.html
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
This Dataset is a collection of HTML files that include examples of Phishing websites and Non-Phishing Websites and can be used to build Classification models on the website content. I created this dataset as a part of my Practicum project for my Masters in Cybersecurity from Georgia Tech.
Cover Photo Source: Photo by Clive Kim from Pexels: https://www.pexels.com/photo/fishing-sea-dawn-landscape-5887837/