54 datasets found

Phishing Website HTML Classification
kaggle.com
Updated Apr 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hunter Kempf (2022). Phishing Website HTML Classification [Dataset]. https://www.kaggle.com/datasets/huntingdata11/phishing-website-html-classification
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 14, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Hunter Kempf
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
This Dataset is a collection of HTML files that include examples of Phishing websites and Non-Phishing Websites and can be used to build Classification models on the website content. I created this dataset as a part of my Practicum project for my Masters in Cybersecurity from Georgia Tech.

Cover Photo Source: Photo by Clive Kim from Pexels: https://www.pexels.com/photo/fishing-sea-dawn-landscape-5887837/
d
HTML files
catalog.data.gov
data.staging.idas-ds1.appdat.jsc.nasa.gov
+3more
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). HTML files [Dataset]. https://catalog.data.gov/dataset/html-files
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
This is a resource where HTML files will be stored for the website
h
HTML-CSS-Website
huggingface.co
Updated Mar 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M Asad Iqbal (2024). HTML-CSS-Website [Dataset]. https://huggingface.co/datasets/MAsad789565/HTML-CSS-Website
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 23, 2024
Authors
M Asad Iqbal
Description
Dataset

This dataset contains a collection of FacebookAds-related queries and responses generated by an AI assistant.

Proudly Dataset Genrated with AI with AI Dataset Generator API
w
Dataset of author, BNB id, book publisher, and publication date of Beginning...
workwithdata.com
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of author, BNB id, book publisher, and publication date of Beginning Web programming with HTML, XHTML, and CSS [Dataset]. https://www.workwithdata.com/datasets/books?col=author%2Cbnb_id%2Cbook%2Cbook%2Cbook_publisher%2Cpublication_date&f=1&fcol0=book&fop0=%3D&fval0=Beginning+Web+programming+with+HTML%2C+XHTML%2C+and+CSS
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 2 rows and is filtered where the book is Beginning Web programming with HTML, XHTML, and CSS. It features 5 columns: author, publication date, book publisher, and BNB id.
R
Web Page Object Detection Dataset
universe.roboflow.com
zip
Updated Mar 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
web page summarizer (2023). Web Page Object Detection Dataset [Dataset]. https://universe.roboflow.com/web-page-summarizer/web-page-object-detection
Explore at:
zipAvailable download formats
Dataset updated
Mar 2, 2023
Dataset authored and provided by
web page summarizer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Web Page Elements Bounding Boxes
Description
Here are a few use cases for this project:

Web Accessibility Improvement: The "Web Page Object Detection" model can be used to identify and label various elements on a web page, making it easier for people with visual impairments to navigate and interact with websites using screen readers and other assistive technologies.

Web Design Analysis: The model can be employed to analyze the structure and layout of popular websites, helping web designers understand best practices and trends in web design. This information can inform the creation of new, user-friendly websites or redesigns of existing pages.

Automatic Web Page Summary Generation: By identifying and extracting key elements, such as titles, headings, content blocks, and lists, the model can assist in generating concise summaries of web pages, which can aid users in their search for relevant information.

Web Page Conversion and Optimization: The model can be used to detect redundant or unnecessary elements on a web page and suggest their removal or modification, leading to cleaner designs and faster-loading pages. This can improve user experience and, potentially, search engine rankings.

Assisting Web Developers in Debugging and Testing: By detecting web page elements, the model can help identify inconsistencies or errors in a site's code or design, such as missing or misaligned elements, allowing developers to quickly diagnose and address these issues.
h
Web_FileStructure_DataSet_100k
huggingface.co
Updated Mar 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kerignard (2025). Web_FileStructure_DataSet_100k [Dataset]. https://huggingface.co/datasets/Juliankrg/Web_FileStructure_DataSet_100k
Explore at:
Dataset updated
Mar 25, 2025
Authors
Kerignard
Description
Dataset Name:

Web File Structure Dataset

Description:

This dataset is designed to train AI models on best practices for organizing files in web development projects. It includes 100,000 examples that cover the structure and conventions of HTML, CSS, JavaScript, and other web-related files. Each example consists of a prompt and a corresponding completion, providing comprehensive guidance on how to organize web project files effectively.

Key Features:… See the full description on the dataset page: https://huggingface.co/datasets/Juliankrg/Web_FileStructure_DataSet_100k.
R
Detect_web_element Dataset
universe.roboflow.com
zip
Updated Nov 24, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
yolo (2022). Detect_web_element Dataset [Dataset]. https://universe.roboflow.com/yolo-ikkms/detect_web_element
Explore at:
zipAvailable download formats
Dataset updated
Nov 24, 2022
Dataset authored and provided by
yolo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Content Bounding Boxes
Description
Detect_web_element

## Overview Detect_web_element is a dataset for object detection tasks - it contains Content annotations for 1,206 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
i
Evolution of Web search engine interfaces through SERP screenshots and HTML...
rdm.inesctec.pt
Updated Jul 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Evolution of Web search engine interfaces through SERP screenshots and HTML complete pages for 20 years - Dataset - CKAN [Dataset]. https://rdm.inesctec.pt/dataset/cs-2021-003
Explore at:
Dataset updated
Jul 26, 2021
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset was extracted for a study on the evolution of Web search engine interfaces since their appearance. The well-known list of “10 blue links” has evolved into richer interfaces, often personalized to the search query, the user, and other aspects. We used the most searched queries by year to extract a representative sample of SERP from the Internet Archive. The Internet Archive has been keeping snapshots and the respective HTML version of webpages over time and tts collection contains more than 50 billion webpages. We used Python and Selenium Webdriver, for browser automation, to visit each capture online, check if the capture is valid, save the HTML version, and generate a full screenshot. The dataset contains all the extracted captures. Each capture is represented by a screenshot, an HTML file, and a files' folder. We concatenate the initial of the search engine (G) with the capture's timestamp for file naming. The filename ends with a sequential integer "-N" if the timestamp is repeated. For example, "G20070330145203-1" identifies a second capture from Google by March 30, 2007. The first is identified by "G20070330145203". Using this dataset, we analyzed how SERP evolved in terms of content, layout, design (e.g., color scheme, text styling, graphics), navigation, and file size. We have registered the appearance of SERP features and analyzed the design patterns involved in each SERP component. We found that the number of elements in SERP has been rising over the years, demanding a more extensive interface area and larger files. This systematic analysis portrays evolution trends in search engine user interfaces and, more generally, web design. We expect this work will trigger other, more specific studies that can take advantage of the dataset we provide here. This graphic represents the diversity of captures by year and search engine (Google and Bing).
Coho Abundance - Linear Features [ds183]
data-cdfw.opendata.arcgis.com
data.ca.gov
+7more
Updated Oct 1, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
California Department of Fish and Wildlife (2014). Coho Abundance - Linear Features [ds183] [Dataset]. https://data-cdfw.opendata.arcgis.com/datasets/CDFW::coho-abundance-linear-features-ds183
Explore at:
Dataset updated
Oct 1, 2014
Dataset authored and provided by
California Department of Fish and Wildlifehttps://wildlife.ca.gov/
Area covered

Description
The CalFish Abundance Database contains a comprehensive collection of anadromous fisheries abundance information. Beginning in 1998, the Pacific States Marine Fisheries Commission, the California Department of Fish and Game, and the National Marine Fisheries Service, began a cooperative project aimed at collecting, archiving, and entering into standardized electronic formats, the wealth of information generated by fisheries resource management agencies and tribes throughout California.Extensive data are currently available for chinook, coho, and steelhead. Major data categories include adult abundance population estimates, actual fish and/or carcass counts, counts of fish collected at dams, weirs, or traps, and redd counts. Harvest data has been compiled for many streams, and hatchery return data has been compiled for the states mitigation facilities. A draft format has been developed for juvenile abundance and awaits final approval. This CalFish Abundance Database shapefile was generated from fully routed 1:100,000 hydrography. In a few cases streams had to be added to the hydrography dataset in order to provide a means to create shapefiles to represent abundance data associated with them. Streams added were digitized at no more than 1:24,000 scale based on stream line images portrayed in 1:24,000 Digital Raster Graphics (DRG).These features generally represent abundance counts resulting from stream surveys. The linear features in this layer typically represent the location for which abundance data records apply. This would be the reach or length of stream surveyed, or the stream sections for which a given population estimate applies. In some cases the actual stream section surveyed was not specified and linear features represent the entire stream. In many cases there are multiple datasets associated with the same length of stream, and so, linear features overlap. Please view the associated datasets for detail regarding specific features. In CalFish these are accessed through the "link" that is visible when performing an identify or query operation. A URL string is provided with each feature in the downloadable data which can also be used to access the underlying datasets.The coho data that is available via the CalFish website is actually linked directly to the StreamNet website where the database's tabular data is currently stored. Additional information about StreamNet may be downloaded at http://www.streamnet.org. Complete documentation for the StreamNet database may be accessed at http://http://www.streamnet.org/def.html
Identifying Interesting Web Pages
kaggle.com
Updated Sep 14, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCI Machine Learning (2017). Identifying Interesting Web Pages [Dataset]. https://www.kaggle.com/uciml/identifying-interesting-web-pages/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 14, 2017
Dataset provided by
Kagglehttp://kaggle.com/
Authors
UCI Machine Learning
Description
Context

The problem is to predict user ratings for web pages (within a subject category). The HTML source of a web page is given. Users looked at each web page and indicated on a 3 point scale (hot medium cold) 50-100 pages per domain.

Content

This database contains HTML source of web pages plus the ratings of a single user on these web pages. Web pages are on four separate subjects (Bands- recording artists; Goats; Sheep; and BioMedical).

Acknowledgement

Data originally from the UCI ML Repository. Donated by:

Michael Pazzani Department of Information and Computer Science, University of California, Irvine Irvine, CA 92697-3425 pazzani@ics.uci.edu

Concept based Information Access with Google for Personalized Information Retrieval
Z
The Klarna Product-Page Dataset
data.niaid.nih.gov
researchdata.se
+1more
Updated Nov 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moradi, Aref (2024). The Klarna Product-Page Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12605479
Explore at:
Dataset updated
Nov 7, 2024
Dataset provided by
Moradi, Aref
Magureanu, Stefan
Risuleo, Riccardo Sven
Hotti, Alexandra
Lagergren, Jens
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Description

The Klarna Product Page Dataset is a dataset of publicly available pages corresponding to products sold online on various e-commerce websites. The dataset contains offline snapshots of 51,701 product pages collected from 8,175 distinct merchants across 8 different markets (US, GB, SE, NL, FI, NO, DE, AT) between 2018 and 2019. On each page, analysts labelled 5 elements of interest: the price of the product, its image, its name and the add-to-cart and go-to-cart buttons (if found). These labels are present in the HTML code as an attribute called klarna-ai-label taking one of the values: Price, Name, Main picture, Add to cart and Cart.

The snapshots are available in 3 formats: as MHTML files (~24GB), as WebTraversalLibrary (WTL) snapshots (~7.4GB), and as screeshots (~8.9GB). The MHTML format is less lossy, a browser can render these pages though any Javascript on the page is lost. The WTL snapshots are produced by loading the MHTML pages into a chromium-based browser. To keep the WTL dataset compact, the screenshots of the rendered MTHML are provided separately; here we provide the HTML of the rendered DOM tree and additional page and element metadata with rendering information (bounding boxes of elements, font sizes etc.). The folder structure of the screenshot dataset is identical to the one the WTL dataset and can be used to complete the WTL snapshots with image information. For convenience, the datasets are provided with a train/test split in which no merchants in the test set are present in the training set.

Corresponding Publication

For more information about the contents of the datasets (statistics etc.) please refer to the following TMLR paper.

GitHub Repository

The code needed to re-run the experiments in the publication accompanying the dataset can be accessed here.

Citing

If you found this dataset useful in your research, please cite the paper as follows:

@article{hotti2024the, title={The Klarna Product Page Dataset: Web Element Nomination with Graph Neural Networks and Large Language Models}, author={Alexandra Hotti and Riccardo Sven Risuleo and Stefan Magureanu and Aref Moradi and Jens Lagergren}, journal={Transactions on Machine Learning Research}, issn={2835-8856}, year={2024}, url={https://openreview.net/forum?id=zz6FesdDbB}, note={} }
P
WebSRC Dataset
paperswithcode.com
opendatalab.com
Updated Nov 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xingyu Chen; Zihan Zhao; Lu Chen; Danyang Zhang; Jiabao Ji; Ao Luo; Yuxuan Xiong; Kai Yu (2024). WebSRC Dataset [Dataset]. https://paperswithcode.com/dataset/websrc
Explore at:
Dataset updated
Nov 21, 2024
Authors
Xingyu Chen; Zihan Zhao; Lu Chen; Danyang Zhang; Jiabao Ji; Ao Luo; Yuxuan Xiong; Kai Yu
Description
WebSRC is a novel Web-based Structural Reading Comprehension dataset. It consists of 0.44M question-answer pairs, which are collected from 6.5K web pages with corresponding HTML source code, screenshots and metadata. Each question in WebSRC requires a certain structural understanding of a web page to answer, and the answer is either a text span on the web page or yes/no.
VA FOIA Website
datasets.ai
data.va.gov
+4more
21
Updated Sep 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Veterans Affairs (2024). VA FOIA Website [Dataset]. https://datasets.ai/datasets/va-foia-website
Explore at:
21Available download formats
Dataset updated
Sep 9, 2024
Dataset provided by
United States Department of Veterans Affairshttp://va.gov/
Authors
Department of Veterans Affairs
Description
U.S. Department of Veterans Affairs Freedom of Information Act Service Webpage with many links to associated information.
State Cancer Profiles Web site
catalog.data.gov
healthdata.gov
+3more
Updated Jul 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Health & Human Services (2023). State Cancer Profiles Web site [Dataset]. https://catalog.data.gov/dataset/state-cancer-profiles-web-site
Explore at:
Dataset updated
Jul 26, 2023
Dataset provided by
United States Department of Health and Human Serviceshttp://www.hhs.gov/
Description
The State Cancer Profiles (SCP) web site provides statistics to help guide and prioritize cancer control activities at the state and local levels. SCP is a collaborative effort using local and national level cancer data from the Centers for Disease Control and Prevention's National Program of Cancer Registries (NPCR) and National Cancer Institute's Surveillance, Epidemiology and End Results Registries (SEER). SCP address select types of cancer and select behavioral risk factors for which there are evidence-based control interventions. The site provides incidence, mortality and prevalence comparison tables as well as interactive graphs and maps and support data. The graphs and maps provide visual support for deciding where to focus cancer control efforts.
The Items Dataset
zenodo.org
Updated Nov 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick Egan; Patrick Egan (2024). The Items Dataset [Dataset]. http://doi.org/10.5281/zenodo.10964134
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.10964134
Dataset updated
Nov 13, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Patrick Egan; Patrick Egan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset originally created 03/01/2019 UPDATE: Packaged on 04/18/2019 UPDATE: Edited README on 04/18/2019

I. About this Data Set This data set is a snapshot of work that is ongoing as a collaboration between Kluge Fellow in Digital Studies, Patrick Egan and an intern at the Library of Congress in the American Folklife Center. It contains a combination of metadata from various collections that contain audio recordings of Irish traditional music. The development of this dataset is iterative, and it integrates visualizations that follow the key principles of trust and approachability. The project, entitled, “Connections In Sound” invites you to use and re-use this data.

The text available in the Items dataset is generated from multiple collections of audio material that were discovered at the American Folklife Center. Each instance of a performance was listed and “sets” or medleys of tunes or songs were split into distinct instances in order to allow machines to read each title separately (whilst still noting that they were part of a group of tunes). The work of the intern was then reviewed before publication, and cross-referenced with the tune index at www.irishtune.info. The Items dataset consists of just over 1000 rows, with new data being added daily in a separate file.

The collections dataset contains at least 37 rows of collections that were located by a reference librarian at the American Folklife Center. This search was complemented by searches of the collections by the scholar both on the internet at https://catalog.loc.gov and by using card catalogs.

Updates to these datasets will be announced and published as the project progresses.

II. What’s included? This data set includes:

The Items Dataset – a .CSV containing Media Note, OriginalFormat, On Website, Collection Ref, Missing In Duplication, Collection, Outside Link, Performer, Solo/multiple, Sub-item, type of tune, Tune, Position, Location, State, Date, Notes/Composer, Potential Linked Data, Instrument, Additional Notes, Tune Cleanup. This .CSV is the direct export of the Items Google Spreadsheet

III. How Was It Created? These data were created by a Kluge Fellow in Digital Studies and an intern on this program over the course of three months. By listening, transcribing, reviewing, and tagging audio recordings, these scholars improve access and connect sounds in the American Folklife Collections by focusing on Irish traditional music. Once transcribed and tagged, information in these datasets is reviewed before publication.

IV. Data Set Field Descriptions

IV

a) Collections dataset field descriptions

ItemId – this is the identifier for the collection that was found at the AFC

Viewed – if the collection has been viewed, or accessed in any way by the researchers.

On LOC – whether or not there are audio recordings of this collection available on the Library of Congress website.

On Other Website – if any of the recordings in this collection are available elsewhere on the internet

Original Format – the format that was used during the creation of the recordings that were found within each collection

Search – this indicates the type of search that was performed in order that resulted in locating recordings and collections within the AFC

Collection – the official title for the collection as noted on the Library of Congress website

State – The primary state where recordings from the collection were located

Other States – The secondary states where recordings from the collection were located

Era / Date – The decade or year associated with each collection

Call Number – This is the official reference number that is used to locate the collections, both in the urls used on the Library website, and in the reference search for catalog cards (catalog cards can be searched at this address: https://memory.loc.gov/diglib/ihas/html/afccards/afccards-home.html)

Finding Aid Online? – Whether or not a finding aid is available for this collection on the internet

b) Items dataset field descriptions

id – the specific identification of the instance of a tune, song or dance within the dataset

Media Note – Any information that is included with the original format, such as identification, name of physical item, additional metadata written on the physical item

Original Format – The physical format that was used when recording each specific performance. Note: this field is used in order to calculate the number of physical items that were created in each collection such as 32 wax cylinders.

On Webste? – Whether or not each instance of a performance is available on the Library of Congress website

Collection Ref – The official reference number of the collection

Missing In Duplication – This column marks if parts of some recordings had been made available on other websites, but not all of the recordings were included in duplication (see recordings from Philadelphia Céilí Group on Villanova University website)

Collection – The official title of the collection given by the American Folklife Center

Outside Link – If recordings are available on other websites externally

Performer – The name of the contributor(s)

Solo/multiple – This field is used to calculate the amount of solo performers vs group performers in each collection

Sub-item – In some cases, physical recordings contained extra details, the sub-item column was used to denote these details

Type of item – This column describes each individual item type, as noted by performers and collectors

Item – The item title, as noted by performers and collectors. If an item was not described, it was entered as “unidentified”

Position – The position on the recording (in some cases during playback, audio cassette player counter markers were used)

Location – Local address of the recording

State – The state where the recording was made

Date – The date that the recording was made

Notes/Composer – The stated composer or source of the item recorded

Potential Linked Data – If items may be linked to other recordings or data, this column was used to provide examples of potential relationships between them

Instrument – The instrument(s) that was used during the performance

Additional Notes – Notes about the process of capturing, transcribing and tagging recordings (for researcher and intern collaboration purposes)

Tune Cleanup – This column was used to tidy each item so that it could be read by machines, but also so that spelling mistakes from the Item column could be corrected, and as an aid to preserving iterations of the editing process

V. Rights statement The text in this data set was created by the researcher and intern and can be used in many different ways under creative commons with attribution. All contributions to Connections In Sound are released into the public domain as they are created. Anyone is free to use and re-use this data set in any way they want, provided reference is given to the creators of these datasets.

VI. Creator and Contributor Information

Creator: Connections In Sound

Contributors: Library of Congress Labs

VII. Contact Information Please direct all questions and comments to Patrick Egan via www.twitter.com/drpatrickegan or via his website at www.patrickegan.org. You can also get in touch with the Library of Congress Labs team via LC-Labs@loc.gov.
Web Graphs
kaggle.com
zip
Updated Nov 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Subhajit Sahu (2021). Web Graphs [Dataset]. https://www.kaggle.com/wolfram77/graphs-web
Explore at:
zip(52848952 bytes)Available download formats
Dataset updated
Nov 11, 2021
Authors
Subhajit Sahu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dynamic face-to-face interaction networks represent the interactions that happen during discussions between a group of participants playing the Resistance game. This dataset contains networks extracted from 62 games. Each game is played by 5-8 participants and lasts between 45--60 minutes. We extract dynamically evolving networks from the free-form discussions using the ICAF algorithm. The extracted networks are used to characterize and detect group deceptive behavior using the DeceptionRank algorithm.

The networks are weighted, directed and temporal. Each node represents a participant. At each 1/3 second, a directed edge from node u to v is weighted by the probability of participant u looking at participant v or the laptop. Additionally, we also provide a binary version where an edge from u to v indicates participant u looks at participant v (or the laptop).

Stanford Network Analysis Platform (SNAP) is a general purpose, high performance system for analysis and manipulation of large networks. Graphs consists of nodes and directed/undirected/multiple edges between the graph nodes. Networks are graphs with data on nodes and/or edges of the network.

The core SNAP library is written in C++ and optimized for maximum performance and compact graph representation. It easily scales to massive networks with hundreds of millions of nodes, and billions of edges. It efficiently manipulates large graphs, calculates structural properties, generates regular and random graphs, and supports attributes on nodes and edges. Besides scalability to large graphs, an additional strength of SNAP is that nodes, edges and attributes in a graph or a network can be changed dynamically during the computation.

SNAP was originally developed by Jure Leskovec in the course of his PhD studies. The first release was made available in Nov, 2009. SNAP uses a general purpose STL (Standard Template Library)-like library GLib developed at Jozef Stefan Institute. SNAP and GLib are being actively developed and used in numerous academic and industrial projects.

http://snap.stanford.edu/data/index.html#face2face
Phishing websites
kaggle.com
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Satya Ganesh Kumar (2023). Phishing websites [Dataset]. https://www.kaggle.com/datasets/satyaganeshkumar/phishing-websites
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 21, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Satya Ganesh Kumar
Description
The "Phishing Data" dataset is a comprehensive collection of information specifically curated for analyzing and understanding phishing attacks. Phishing attacks involve malicious attempts to deceive individuals or organizations into disclosing sensitive information such as passwords or credit card details. This dataset comprises 18 distinct features that offer valuable insights into the characteristics of phishing attempts. These features include the URL of the website being analyzed, the length of the URL, the use of URL shortening services, the presence of the "@" symbol, the presence of redirection using "//", the presence of prefixes or suffixes in the URL, the number of subdomains, the usage of secure connection protocols (HTTPS), the length of time since domain registration, the presence of a favicon, the presence of HTTP or HTTPS tokens in the domain name, the URL of requested external resources, the presence of anchors in the URL, the number of hyperlinks in HTML tags, the server form handler used, the submission of data to email addresses, abnormal URL patterns, and estimated website traffic or popularity. Together, these features enable the analysis and detection of phishing attempts in the "Phishing Data" dataset, aiding in the development of models and algorithms to combat phishing attacks.
P
Product Page Dataset
paperswithcode.com
opendatalab.com
Updated Nov 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexandra Hotti; Riccardo Sven Risuleo; Stefan Magureanu; Aref Moradi; Jens Lagergren (2021). Product Page Dataset [Dataset]. https://paperswithcode.com/dataset/product-page
Explore at:
Dataset updated
Nov 2, 2021
Authors
Alexandra Hotti; Riccardo Sven Risuleo; Stefan Magureanu; Aref Moradi; Jens Lagergren
Description
Product Page is a large-scale and realistic dataset of webpages. The dataset contains 51,701 manually labeled product pages from 8,175 real e-commerce websites. The pages can be rendered entirely in a web browser and are suitable for computer vision applications. This makes it substantially richer and more diverse than other datasets proposed for element representation learning, classification and prediction on the web.
o
Web Data Commons (October 2021) Property and Datatype Usage Dataset
explore.openaire.eu
Updated Mar 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Martin Keil (2022). Web Data Commons (October 2021) Property and Datatype Usage Dataset [Dataset]. http://doi.org/10.5281/zenodo.6337660
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6337660
Dataset updated
Mar 15, 2022
Authors
Jan Martin Keil
Description
This is a dataset about the usage of properties and datatypes in the Web Data Commons RDFa, Microdata, Embedded JSON-LD, and Microformats Data Sets (October 2021) based on the Common Crawl October 2021 archive. The dataset has been produced using the RDF Property and Datatype Usage Scanner v2.1.1, which is based on the Apache Jena framework. Only RDFa and embedded JSON-LD data were considered, as Microdata and Microformats do not incorporate explicit datatypes. Dataset Properties Size: 0.2 GiB compressed, 4.4 GiB uncompressed, 20 361 829 rows plus 1 head line determined using gunzip -c measurements.csv.gz | wc -l Parsing Failures: The scanner failed to parse 45 833 332 triples (~0.1 %) of the source dataset (containing 38 812 275 607 triples). Content: CATEGORY: The category (html-embedded-jsonld or html-rdfa) of the Web Data Commons file that has been measured. FILE_URL: The URL of the Web Data Commons file that has been measured. MEASUREMENT: The applied measurement with specific conditions, one of: UnpreciseRepresentableInDouble: The number of lexicals that are in the lexical space but not in the value space of xsd:double. UnpreciseRepresentableInFloat: The number of lexicals that are in the lexical space but not in the value space of xsd:float. UsedAsDatatype: The total number of literals with the datatype. UsedAsPropertyRange: The number of statements that specify the datatype as range of the property. ValidDateNotation: The number of lexicals that are in the lexical space of xsd:date. ValidDateTimeNotation: The number of lexicals that are in the lexical space of xsd:dateTime. ValidDecimalNotation: The number of lexicals that represent a number with decimal notation and whose lexical representation is thereby in the lexical space of xsd:decimal, xsd:float, and xsd:double. ValidExponentialNotation: The number of lexicals that represent a number with exponential notation and whose lexical representation is thereby in the lexical space of xsd:float, and xsd:double. ValidInfOrNaNNotation: The number of lexicals that equals either INF, +INF, -INF or NaN and whose lexical representation is thereby in the lexical space of xsd:float, and xsd:double. ValidIntegerNotation: The number of lexicals that represent an integer number and whose lexical representation is thereby in the lexical space of xsd:integer, xsd:decimal, xsd:float, and xsd:double. ValidTimeNotation: The number of lexicals that are in the lexical space of xsd:time. ValidTrueOrFalseNotation: The number of lexicals that equal either true or false and whose lexical representation is thereby in the lexical space of xsd:boolean. ValidZeroOrOneNotation: The number of lexicals that equal either 0 or 1 and whose lexical representation is thereby in the lexical space of xsd:boolean, and xsd:integer, xsd:decimal, xsd:float, and xsd:double. Note: Lexical representation of xsd:double values in embedded JSON-LD got normalized to always use exponential notation with up to 16 fractional digits (see related code). Be careful by drawing conclusions from according Valid… and Unprecise… measures. PROPERTY: The property that has been measured. DATATYPE: The datatype that has been measured. QUANTITY: The count of statements that fulfill the condition specified by the measurement per file, property and datatype. Preview "CATEGORY","FILE_URL","MEASUREMENT","PROPERTY","DATATYPE","QUANTITY" "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2021-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","https://www.w3.org/2006/vcard/ns#longitude","https://www.w3.org/2001/XMLSchema#float","1" "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2021-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","https://www.w3.org/2006/vcard/ns#latitude","https://www.w3.org/2001/XMLSchema#float","1" "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2021-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","https://purl.org/goodrelations/v1#hasCurrencyValue","https://www.w3.org/2001/XMLSchema#float","6" … "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2021-12/quads/dpef.html-embedded-jsonld.nq-06239.gz","ValidZeroOrOneNotation","http://schema.org/ratingValue","http://www.w3.org/2001/XMLSchema#integer","96" "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2021-12/quads/dpef.html-embedded-jsonld.nq-06239.gz","ValidZeroOrOneNotation","http://schema.org/minValue","http://www.w3.org/2001/XMLSchema#integer","164" "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2021-12/quads/dpef.html-embedded-jsonld.nq-06239.gz","ValidZeroOrOneNotation","http://schema.org/width","http://www.w3.org/2001/XMLSchema#integer","361" Note: The data contain malformed IRIs, like "xsd:dateTime" (instead of probably "http://www.w3.org/2001/XMLSchema#dateTime"), which are caused by missing namespace definitions ...
Web Data Commons Phones Dataset, Augmented Version, Fixed Splits
linkagelibrary.icpsr.umich.edu
delimited
Updated Nov 23, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anna Primpeli; Christian Bizer (2020). Web Data Commons Phones Dataset, Augmented Version, Fixed Splits [Dataset]. http://doi.org/10.3886/E127243V1
Explore at:
delimitedAvailable download formats
Unique identifier
https://doi.org/10.3886/E127243V1
Dataset updated
Nov 23, 2020
Dataset provided by
University of Mannheim (Germany)
Authors
Anna Primpeli; Christian Bizer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Motivation: Entity Matching is the task of determining which records from different data sources describe the same real-world entity. It is an important task for data integration and has been the focus of many research works. A large number of entity matching/record linkage tasks has been made available for evaluating entity matching methods. However, the lack of fixed development and test splits as well as correspondence sets including both matching and non-matching record pairs hinders the reproducibility and comparability of benchmark experiments. In an effort to enhance the reproducibility and comparability of the experiments, we complement existing entity matching benchmark tasks with fixed sets of non-matching pairs as well as fixed development and test splits. Dataset Description: An augmented version of the wdc phones dataset for benchmarking entity matching/record linkage methods found at:http://webdatacommons.org/productcorpus/index.html#toc4 The augmented version adds fixed splits for training, validation and testing as well as their corresponding feature vectors. The feature vectors are built using data type specific similarity metrics.The dataset contains 447 records describing products deriving from 17 e-shops which are matched against a product catalog of 50 products. The gold standards have manual annotations for 258 matching and 22,092 non-matching pairs. The total number of attributes used to decribe the product records are 26 while the attribute density is 0.25. The augmented dataset enhances the reproducibility of matching methods and the comparability of matching results. The dataset is part of the CompERBench repository which provides 21 complete benchmark tasks for entity matching for public download: http://data.dws.informatik.uni-mannheim.de/benchmarkmatchingtasks/index.html

Facebook

Twitter

Click to copy link

Link copied

Cite

Hunter Kempf (2022). Phishing Website HTML Classification [Dataset]. https://www.kaggle.com/datasets/huntingdata11/phishing-website-html-classification

Phishing Website HTML Classification

Full HTML files showing example Phishing and Non-Phishing Websites

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Apr 14, 2022

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Hunter Kempf

License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

This Dataset is a collection of HTML files that include examples of Phishing websites and Non-Phishing Websites and can be used to build Classification models on the website content. I created this dataset as a part of my Practicum project for my Masters in Cybersecurity from Georgia Tech.

Cover Photo Source: Photo by Clive Kim from Pexels: https://www.pexels.com/photo/fishing-sea-dawn-landscape-5887837/

Clear search

Close search

Google apps

Main menu

Phishing Website HTML Classification

HTML files

HTML-CSS-Website

Dataset

Proudly Dataset Genrated with AI with AI Dataset Generator API

Dataset of author, BNB id, book publisher, and publication date of Beginning...

Web Page Object Detection Dataset

Web_FileStructure_DataSet_100k

Detect_web_element Dataset

Detect_web_element

Evolution of Web search engine interfaces through SERP screenshots and HTML...

Coho Abundance - Linear Features [ds183]

Identifying Interesting Web Pages

Context

Content

Acknowledgement

The Klarna Product-Page Dataset

WebSRC Dataset

VA FOIA Website

State Cancer Profiles Web site

The Items Dataset

Web Graphs

Phishing websites

Product Page Dataset

Web Data Commons (October 2021) Property and Datatype Usage Dataset

Web Data Commons Phones Dataset, Augmented Version, Fixed Splits

Phishing Website HTML Classification

Full HTML files showing example Phishing and Non-Phishing Websites