54 datasets found
  1. Phishing Website HTML Classification

    • kaggle.com
    Updated Apr 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hunter Kempf (2022). Phishing Website HTML Classification [Dataset]. https://www.kaggle.com/datasets/huntingdata11/phishing-website-html-classification
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 14, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Hunter Kempf
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    This Dataset is a collection of HTML files that include examples of Phishing websites and Non-Phishing Websites and can be used to build Classification models on the website content. I created this dataset as a part of my Practicum project for my Masters in Cybersecurity from Georgia Tech.

    Cover Photo Source: Photo by Clive Kim from Pexels: https://www.pexels.com/photo/fishing-sea-dawn-landscape-5887837/

  2. d

    HTML files

    • catalog.data.gov
    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    • +3more
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). HTML files [Dataset]. https://catalog.data.gov/dataset/html-files
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    This is a resource where HTML files will be stored for the website

  3. h

    HTML-CSS-Website

    • huggingface.co
    Updated Mar 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M Asad Iqbal (2024). HTML-CSS-Website [Dataset]. https://huggingface.co/datasets/MAsad789565/HTML-CSS-Website
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 23, 2024
    Authors
    M Asad Iqbal
    Description

    Dataset

    This dataset contains a collection of FacebookAds-related queries and responses generated by an AI assistant.

    Proudly Dataset Genrated with AI with AI Dataset Generator API

  4. w

    Dataset of author, BNB id, book publisher, and publication date of Beginning...

    • workwithdata.com
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of author, BNB id, book publisher, and publication date of Beginning Web programming with HTML, XHTML, and CSS [Dataset]. https://www.workwithdata.com/datasets/books?col=author%2Cbnb_id%2Cbook%2Cbook%2Cbook_publisher%2Cpublication_date&f=1&fcol0=book&fop0=%3D&fval0=Beginning+Web+programming+with+HTML%2C+XHTML%2C+and+CSS
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about books. It has 2 rows and is filtered where the book is Beginning Web programming with HTML, XHTML, and CSS. It features 5 columns: author, publication date, book publisher, and BNB id.

  5. R

    Web Page Object Detection Dataset

    • universe.roboflow.com
    zip
    Updated Mar 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    web page summarizer (2023). Web Page Object Detection Dataset [Dataset]. https://universe.roboflow.com/web-page-summarizer/web-page-object-detection
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 2, 2023
    Dataset authored and provided by
    web page summarizer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Web Page Elements Bounding Boxes
    Description

    Here are a few use cases for this project:

    1. Web Accessibility Improvement: The "Web Page Object Detection" model can be used to identify and label various elements on a web page, making it easier for people with visual impairments to navigate and interact with websites using screen readers and other assistive technologies.

    2. Web Design Analysis: The model can be employed to analyze the structure and layout of popular websites, helping web designers understand best practices and trends in web design. This information can inform the creation of new, user-friendly websites or redesigns of existing pages.

    3. Automatic Web Page Summary Generation: By identifying and extracting key elements, such as titles, headings, content blocks, and lists, the model can assist in generating concise summaries of web pages, which can aid users in their search for relevant information.

    4. Web Page Conversion and Optimization: The model can be used to detect redundant or unnecessary elements on a web page and suggest their removal or modification, leading to cleaner designs and faster-loading pages. This can improve user experience and, potentially, search engine rankings.

    5. Assisting Web Developers in Debugging and Testing: By detecting web page elements, the model can help identify inconsistencies or errors in a site's code or design, such as missing or misaligned elements, allowing developers to quickly diagnose and address these issues.

  6. h

    Web_FileStructure_DataSet_100k

    • huggingface.co
    Updated Mar 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kerignard (2025). Web_FileStructure_DataSet_100k [Dataset]. https://huggingface.co/datasets/Juliankrg/Web_FileStructure_DataSet_100k
    Explore at:
    Dataset updated
    Mar 25, 2025
    Authors
    Kerignard
    Description

    Dataset Name:

    Web File Structure Dataset

      Description:
    

    This dataset is designed to train AI models on best practices for organizing files in web development projects. It includes 100,000 examples that cover the structure and conventions of HTML, CSS, JavaScript, and other web-related files. Each example consists of a prompt and a corresponding completion, providing comprehensive guidance on how to organize web project files effectively.

      Key Features:… See the full description on the dataset page: https://huggingface.co/datasets/Juliankrg/Web_FileStructure_DataSet_100k.
    
  7. R

    Detect_web_element Dataset

    • universe.roboflow.com
    zip
    Updated Nov 24, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    yolo (2022). Detect_web_element Dataset [Dataset]. https://universe.roboflow.com/yolo-ikkms/detect_web_element
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 24, 2022
    Dataset authored and provided by
    yolo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Content Bounding Boxes
    Description

    Detect_web_element

    ## Overview
    
    Detect_web_element is a dataset for object detection tasks - it contains Content annotations for 1,206 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  8. i

    Evolution of Web search engine interfaces through SERP screenshots and HTML...

    • rdm.inesctec.pt
    Updated Jul 26, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Evolution of Web search engine interfaces through SERP screenshots and HTML complete pages for 20 years - Dataset - CKAN [Dataset]. https://rdm.inesctec.pt/dataset/cs-2021-003
    Explore at:
    Dataset updated
    Jul 26, 2021
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset was extracted for a study on the evolution of Web search engine interfaces since their appearance. The well-known list of “10 blue links” has evolved into richer interfaces, often personalized to the search query, the user, and other aspects. We used the most searched queries by year to extract a representative sample of SERP from the Internet Archive. The Internet Archive has been keeping snapshots and the respective HTML version of webpages over time and tts collection contains more than 50 billion webpages. We used Python and Selenium Webdriver, for browser automation, to visit each capture online, check if the capture is valid, save the HTML version, and generate a full screenshot. The dataset contains all the extracted captures. Each capture is represented by a screenshot, an HTML file, and a files' folder. We concatenate the initial of the search engine (G) with the capture's timestamp for file naming. The filename ends with a sequential integer "-N" if the timestamp is repeated. For example, "G20070330145203-1" identifies a second capture from Google by March 30, 2007. The first is identified by "G20070330145203". Using this dataset, we analyzed how SERP evolved in terms of content, layout, design (e.g., color scheme, text styling, graphics), navigation, and file size. We have registered the appearance of SERP features and analyzed the design patterns involved in each SERP component. We found that the number of elements in SERP has been rising over the years, demanding a more extensive interface area and larger files. This systematic analysis portrays evolution trends in search engine user interfaces and, more generally, web design. We expect this work will trigger other, more specific studies that can take advantage of the dataset we provide here. This graphic represents the diversity of captures by year and search engine (Google and Bing).

  9. Coho Abundance - Linear Features [ds183]

    • data-cdfw.opendata.arcgis.com
    • data.ca.gov
    • +7more
    Updated Oct 1, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    California Department of Fish and Wildlife (2014). Coho Abundance - Linear Features [ds183] [Dataset]. https://data-cdfw.opendata.arcgis.com/datasets/CDFW::coho-abundance-linear-features-ds183
    Explore at:
    Dataset updated
    Oct 1, 2014
    Dataset authored and provided by
    California Department of Fish and Wildlifehttps://wildlife.ca.gov/
    Area covered
    Description

    The CalFish Abundance Database contains a comprehensive collection of anadromous fisheries abundance information. Beginning in 1998, the Pacific States Marine Fisheries Commission, the California Department of Fish and Game, and the National Marine Fisheries Service, began a cooperative project aimed at collecting, archiving, and entering into standardized electronic formats, the wealth of information generated by fisheries resource management agencies and tribes throughout California.Extensive data are currently available for chinook, coho, and steelhead. Major data categories include adult abundance population estimates, actual fish and/or carcass counts, counts of fish collected at dams, weirs, or traps, and redd counts. Harvest data has been compiled for many streams, and hatchery return data has been compiled for the states mitigation facilities. A draft format has been developed for juvenile abundance and awaits final approval. This CalFish Abundance Database shapefile was generated from fully routed 1:100,000 hydrography. In a few cases streams had to be added to the hydrography dataset in order to provide a means to create shapefiles to represent abundance data associated with them. Streams added were digitized at no more than 1:24,000 scale based on stream line images portrayed in 1:24,000 Digital Raster Graphics (DRG).These features generally represent abundance counts resulting from stream surveys. The linear features in this layer typically represent the location for which abundance data records apply. This would be the reach or length of stream surveyed, or the stream sections for which a given population estimate applies. In some cases the actual stream section surveyed was not specified and linear features represent the entire stream. In many cases there are multiple datasets associated with the same length of stream, and so, linear features overlap. Please view the associated datasets for detail regarding specific features. In CalFish these are accessed through the "link" that is visible when performing an identify or query operation. A URL string is provided with each feature in the downloadable data which can also be used to access the underlying datasets.The coho data that is available via the CalFish website is actually linked directly to the StreamNet website where the database's tabular data is currently stored. Additional information about StreamNet may be downloaded at http://www.streamnet.org. Complete documentation for the StreamNet database may be accessed at http://http://www.streamnet.org/def.html

  10. Identifying Interesting Web Pages

    • kaggle.com
    Updated Sep 14, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCI Machine Learning (2017). Identifying Interesting Web Pages [Dataset]. https://www.kaggle.com/uciml/identifying-interesting-web-pages/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 14, 2017
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    UCI Machine Learning
    Description

    Context

    The problem is to predict user ratings for web pages (within a subject category). The HTML source of a web page is given. Users looked at each web page and indicated on a 3 point scale (hot medium cold) 50-100 pages per domain.

    Content

    This database contains HTML source of web pages plus the ratings of a single user on these web pages. Web pages are on four separate subjects (Bands- recording artists; Goats; Sheep; and BioMedical).

    Acknowledgement

    Data originally from the UCI ML Repository. Donated by:

    Michael Pazzani Department of Information and Computer Science, University of California, Irvine Irvine, CA 92697-3425 pazzani@ics.uci.edu

    Concept based Information Access with Google for Personalized Information Retrieval

  11. Z

    The Klarna Product-Page Dataset

    • data.niaid.nih.gov
    • researchdata.se
    • +1more
    Updated Nov 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Moradi, Aref (2024). The Klarna Product-Page Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12605479
    Explore at:
    Dataset updated
    Nov 7, 2024
    Dataset provided by
    Moradi, Aref
    Magureanu, Stefan
    Risuleo, Riccardo Sven
    Hotti, Alexandra
    Lagergren, Jens
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Description

    The Klarna Product Page Dataset is a dataset of publicly available pages corresponding to products sold online on various e-commerce websites. The dataset contains offline snapshots of 51,701 product pages collected from 8,175 distinct merchants across 8 different markets (US, GB, SE, NL, FI, NO, DE, AT) between 2018 and 2019. On each page, analysts labelled 5 elements of interest: the price of the product, its image, its name and the add-to-cart and go-to-cart buttons (if found). These labels are present in the HTML code as an attribute called klarna-ai-label taking one of the values: Price, Name, Main picture, Add to cart and Cart.

    The snapshots are available in 3 formats: as MHTML files (~24GB), as WebTraversalLibrary (WTL) snapshots (~7.4GB), and as screeshots (~8.9GB). The MHTML format is less lossy, a browser can render these pages though any Javascript on the page is lost. The WTL snapshots are produced by loading the MHTML pages into a chromium-based browser. To keep the WTL dataset compact, the screenshots of the rendered MTHML are provided separately; here we provide the HTML of the rendered DOM tree and additional page and element metadata with rendering information (bounding boxes of elements, font sizes etc.). The folder structure of the screenshot dataset is identical to the one the WTL dataset and can be used to complete the WTL snapshots with image information. For convenience, the datasets are provided with a train/test split in which no merchants in the test set are present in the training set.

    Corresponding Publication

    For more information about the contents of the datasets (statistics etc.) please refer to the following TMLR paper.

    GitHub Repository

    The code needed to re-run the experiments in the publication accompanying the dataset can be accessed here.

    Citing

    If you found this dataset useful in your research, please cite the paper as follows:

    @article{hotti2024the, title={The Klarna Product Page Dataset: Web Element Nomination with Graph Neural Networks and Large Language Models}, author={Alexandra Hotti and Riccardo Sven Risuleo and Stefan Magureanu and Aref Moradi and Jens Lagergren}, journal={Transactions on Machine Learning Research}, issn={2835-8856}, year={2024}, url={https://openreview.net/forum?id=zz6FesdDbB}, note={} }

  12. P

    WebSRC Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Nov 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xingyu Chen; Zihan Zhao; Lu Chen; Danyang Zhang; Jiabao Ji; Ao Luo; Yuxuan Xiong; Kai Yu (2024). WebSRC Dataset [Dataset]. https://paperswithcode.com/dataset/websrc
    Explore at:
    Dataset updated
    Nov 21, 2024
    Authors
    Xingyu Chen; Zihan Zhao; Lu Chen; Danyang Zhang; Jiabao Ji; Ao Luo; Yuxuan Xiong; Kai Yu
    Description

    WebSRC is a novel Web-based Structural Reading Comprehension dataset. It consists of 0.44M question-answer pairs, which are collected from 6.5K web pages with corresponding HTML source code, screenshots and metadata. Each question in WebSRC requires a certain structural understanding of a web page to answer, and the answer is either a text span on the web page or yes/no.

  13. VA FOIA Website

    • datasets.ai
    • data.va.gov
    • +4more
    21
    Updated Sep 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Veterans Affairs (2024). VA FOIA Website [Dataset]. https://datasets.ai/datasets/va-foia-website
    Explore at:
    21Available download formats
    Dataset updated
    Sep 9, 2024
    Dataset provided by
    United States Department of Veterans Affairshttp://va.gov/
    Authors
    Department of Veterans Affairs
    Description

    U.S. Department of Veterans Affairs Freedom of Information Act Service Webpage with many links to associated information.

  14. State Cancer Profiles Web site

    • catalog.data.gov
    • healthdata.gov
    • +3more
    Updated Jul 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Health & Human Services (2023). State Cancer Profiles Web site [Dataset]. https://catalog.data.gov/dataset/state-cancer-profiles-web-site
    Explore at:
    Dataset updated
    Jul 26, 2023
    Dataset provided by
    United States Department of Health and Human Serviceshttp://www.hhs.gov/
    Description

    The State Cancer Profiles (SCP) web site provides statistics to help guide and prioritize cancer control activities at the state and local levels. SCP is a collaborative effort using local and national level cancer data from the Centers for Disease Control and Prevention's National Program of Cancer Registries (NPCR) and National Cancer Institute's Surveillance, Epidemiology and End Results Registries (SEER). SCP address select types of cancer and select behavioral risk factors for which there are evidence-based control interventions. The site provides incidence, mortality and prevalence comparison tables as well as interactive graphs and maps and support data. The graphs and maps provide visual support for deciding where to focus cancer control efforts.

  15. The Items Dataset

    • zenodo.org
    Updated Nov 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Egan; Patrick Egan (2024). The Items Dataset [Dataset]. http://doi.org/10.5281/zenodo.10964134
    Explore at:
    Dataset updated
    Nov 13, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Patrick Egan; Patrick Egan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset originally created 03/01/2019 UPDATE: Packaged on 04/18/2019 UPDATE: Edited README on 04/18/2019

    I. About this Data Set This data set is a snapshot of work that is ongoing as a collaboration between Kluge Fellow in Digital Studies, Patrick Egan and an intern at the Library of Congress in the American Folklife Center. It contains a combination of metadata from various collections that contain audio recordings of Irish traditional music. The development of this dataset is iterative, and it integrates visualizations that follow the key principles of trust and approachability. The project, entitled, “Connections In Sound” invites you to use and re-use this data.

    The text available in the Items dataset is generated from multiple collections of audio material that were discovered at the American Folklife Center. Each instance of a performance was listed and “sets” or medleys of tunes or songs were split into distinct instances in order to allow machines to read each title separately (whilst still noting that they were part of a group of tunes). The work of the intern was then reviewed before publication, and cross-referenced with the tune index at www.irishtune.info. The Items dataset consists of just over 1000 rows, with new data being added daily in a separate file.

    The collections dataset contains at least 37 rows of collections that were located by a reference librarian at the American Folklife Center. This search was complemented by searches of the collections by the scholar both on the internet at https://catalog.loc.gov and by using card catalogs.

    Updates to these datasets will be announced and published as the project progresses.

    II. What’s included? This data set includes:

    • The Items Dataset – a .CSV containing Media Note, OriginalFormat, On Website, Collection Ref, Missing In Duplication, Collection, Outside Link, Performer, Solo/multiple, Sub-item, type of tune, Tune, Position, Location, State, Date, Notes/Composer, Potential Linked Data, Instrument, Additional Notes, Tune Cleanup. This .CSV is the direct export of the Items Google Spreadsheet

    III. How Was It Created? These data were created by a Kluge Fellow in Digital Studies and an intern on this program over the course of three months. By listening, transcribing, reviewing, and tagging audio recordings, these scholars improve access and connect sounds in the American Folklife Collections by focusing on Irish traditional music. Once transcribed and tagged, information in these datasets is reviewed before publication.

    IV. Data Set Field Descriptions

    IV

    a) Collections dataset field descriptions

    • ItemId – this is the identifier for the collection that was found at the AFC
    • Viewed – if the collection has been viewed, or accessed in any way by the researchers.
    • On LOC – whether or not there are audio recordings of this collection available on the Library of Congress website.
    • On Other Website – if any of the recordings in this collection are available elsewhere on the internet
    • Original Format – the format that was used during the creation of the recordings that were found within each collection
    • Search – this indicates the type of search that was performed in order that resulted in locating recordings and collections within the AFC
    • Collection – the official title for the collection as noted on the Library of Congress website
    • State – The primary state where recordings from the collection were located
    • Other States – The secondary states where recordings from the collection were located
    • Era / Date – The decade or year associated with each collection
    • Call Number – This is the official reference number that is used to locate the collections, both in the urls used on the Library website, and in the reference search for catalog cards (catalog cards can be searched at this address: https://memory.loc.gov/diglib/ihas/html/afccards/afccards-home.html)
    • Finding Aid Online? – Whether or not a finding aid is available for this collection on the internet

    b) Items dataset field descriptions

    • id – the specific identification of the instance of a tune, song or dance within the dataset
    • Media Note – Any information that is included with the original format, such as identification, name of physical item, additional metadata written on the physical item
    • Original Format – The physical format that was used when recording each specific performance. Note: this field is used in order to calculate the number of physical items that were created in each collection such as 32 wax cylinders.
    • On Webste? – Whether or not each instance of a performance is available on the Library of Congress website
    • Collection Ref – The official reference number of the collection
    • Missing In Duplication – This column marks if parts of some recordings had been made available on other websites, but not all of the recordings were included in duplication (see recordings from Philadelphia CĂ©ilĂ­ Group on Villanova University website)
    • Collection – The official title of the collection given by the American Folklife Center
    • Outside Link – If recordings are available on other websites externally
    • Performer – The name of the contributor(s)
    • Solo/multiple – This field is used to calculate the amount of solo performers vs group performers in each collection
    • Sub-item – In some cases, physical recordings contained extra details, the sub-item column was used to denote these details
    • Type of item – This column describes each individual item type, as noted by performers and collectors
    • Item – The item title, as noted by performers and collectors. If an item was not described, it was entered as “unidentified”
    • Position – The position on the recording (in some cases during playback, audio cassette player counter markers were used)
    • Location – Local address of the recording
    • State – The state where the recording was made
    • Date – The date that the recording was made
    • Notes/Composer – The stated composer or source of the item recorded
    • Potential Linked Data – If items may be linked to other recordings or data, this column was used to provide examples of potential relationships between them
    • Instrument – The instrument(s) that was used during the performance
    • Additional Notes – Notes about the process of capturing, transcribing and tagging recordings (for researcher and intern collaboration purposes)
    • Tune Cleanup – This column was used to tidy each item so that it could be read by machines, but also so that spelling mistakes from the Item column could be corrected, and as an aid to preserving iterations of the editing process

    V. Rights statement The text in this data set was created by the researcher and intern and can be used in many different ways under creative commons with attribution. All contributions to Connections In Sound are released into the public domain as they are created. Anyone is free to use and re-use this data set in any way they want, provided reference is given to the creators of these datasets.

    VI. Creator and Contributor Information

    Creator: Connections In Sound

    Contributors: Library of Congress Labs

    VII. Contact Information Please direct all questions and comments to Patrick Egan via www.twitter.com/drpatrickegan or via his website at www.patrickegan.org. You can also get in touch with the Library of Congress Labs team via LC-Labs@loc.gov.

  16. Web Graphs

    • kaggle.com
    zip
    Updated Nov 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhajit Sahu (2021). Web Graphs [Dataset]. https://www.kaggle.com/wolfram77/graphs-web
    Explore at:
    zip(52848952 bytes)Available download formats
    Dataset updated
    Nov 11, 2021
    Authors
    Subhajit Sahu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dynamic face-to-face interaction networks represent the interactions that happen during discussions between a group of participants playing the Resistance game. This dataset contains networks extracted from 62 games. Each game is played by 5-8 participants and lasts between 45--60 minutes. We extract dynamically evolving networks from the free-form discussions using the ICAF algorithm. The extracted networks are used to characterize and detect group deceptive behavior using the DeceptionRank algorithm.

    The networks are weighted, directed and temporal. Each node represents a participant. At each 1/3 second, a directed edge from node u to v is weighted by the probability of participant u looking at participant v or the laptop. Additionally, we also provide a binary version where an edge from u to v indicates participant u looks at participant v (or the laptop).

    Stanford Network Analysis Platform (SNAP) is a general purpose, high performance system for analysis and manipulation of large networks. Graphs consists of nodes and directed/undirected/multiple edges between the graph nodes. Networks are graphs with data on nodes and/or edges of the network.

    The core SNAP library is written in C++ and optimized for maximum performance and compact graph representation. It easily scales to massive networks with hundreds of millions of nodes, and billions of edges. It efficiently manipulates large graphs, calculates structural properties, generates regular and random graphs, and supports attributes on nodes and edges. Besides scalability to large graphs, an additional strength of SNAP is that nodes, edges and attributes in a graph or a network can be changed dynamically during the computation.

    SNAP was originally developed by Jure Leskovec in the course of his PhD studies. The first release was made available in Nov, 2009. SNAP uses a general purpose STL (Standard Template Library)-like library GLib developed at Jozef Stefan Institute. SNAP and GLib are being actively developed and used in numerous academic and industrial projects.

    http://snap.stanford.edu/data/index.html#face2face

  17. Phishing websites

    • kaggle.com
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Satya Ganesh Kumar (2023). Phishing websites [Dataset]. https://www.kaggle.com/datasets/satyaganeshkumar/phishing-websites
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Satya Ganesh Kumar
    Description

    The "Phishing Data" dataset is a comprehensive collection of information specifically curated for analyzing and understanding phishing attacks. Phishing attacks involve malicious attempts to deceive individuals or organizations into disclosing sensitive information such as passwords or credit card details. This dataset comprises 18 distinct features that offer valuable insights into the characteristics of phishing attempts. These features include the URL of the website being analyzed, the length of the URL, the use of URL shortening services, the presence of the "@" symbol, the presence of redirection using "//", the presence of prefixes or suffixes in the URL, the number of subdomains, the usage of secure connection protocols (HTTPS), the length of time since domain registration, the presence of a favicon, the presence of HTTP or HTTPS tokens in the domain name, the URL of requested external resources, the presence of anchors in the URL, the number of hyperlinks in HTML tags, the server form handler used, the submission of data to email addresses, abnormal URL patterns, and estimated website traffic or popularity. Together, these features enable the analysis and detection of phishing attempts in the "Phishing Data" dataset, aiding in the development of models and algorithms to combat phishing attacks.

  18. P

    Product Page Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Nov 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexandra Hotti; Riccardo Sven Risuleo; Stefan Magureanu; Aref Moradi; Jens Lagergren (2021). Product Page Dataset [Dataset]. https://paperswithcode.com/dataset/product-page
    Explore at:
    Dataset updated
    Nov 2, 2021
    Authors
    Alexandra Hotti; Riccardo Sven Risuleo; Stefan Magureanu; Aref Moradi; Jens Lagergren
    Description

    Product Page is a large-scale and realistic dataset of webpages. The dataset contains 51,701 manually labeled product pages from 8,175 real e-commerce websites. The pages can be rendered entirely in a web browser and are suitable for computer vision applications. This makes it substantially richer and more diverse than other datasets proposed for element representation learning, classification and prediction on the web.

  19. o

    Web Data Commons (October 2021) Property and Datatype Usage Dataset

    • explore.openaire.eu
    Updated Mar 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Martin Keil (2022). Web Data Commons (October 2021) Property and Datatype Usage Dataset [Dataset]. http://doi.org/10.5281/zenodo.6337660
    Explore at:
    Dataset updated
    Mar 15, 2022
    Authors
    Jan Martin Keil
    Description

    This is a dataset about the usage of properties and datatypes in the Web Data Commons RDFa, Microdata, Embedded JSON-LD, and Microformats Data Sets (October 2021) based on the Common Crawl October 2021 archive. The dataset has been produced using the RDF Property and Datatype Usage Scanner v2.1.1, which is based on the Apache Jena framework. Only RDFa and embedded JSON-LD data were considered, as Microdata and Microformats do not incorporate explicit datatypes. Dataset Properties Size: 0.2 GiB compressed, 4.4 GiB uncompressed, 20 361 829 rows plus 1 head line determined using gunzip -c measurements.csv.gz | wc -l Parsing Failures: The scanner failed to parse 45 833 332 triples (~0.1 %) of the source dataset (containing 38 812 275 607 triples). Content: CATEGORY: The category (html-embedded-jsonld or html-rdfa) of the Web Data Commons file that has been measured. FILE_URL: The URL of the Web Data Commons file that has been measured. MEASUREMENT: The applied measurement with specific conditions, one of: UnpreciseRepresentableInDouble: The number of lexicals that are in the lexical space but not in the value space of xsd:double. UnpreciseRepresentableInFloat: The number of lexicals that are in the lexical space but not in the value space of xsd:float. UsedAsDatatype: The total number of literals with the datatype. UsedAsPropertyRange: The number of statements that specify the datatype as range of the property. ValidDateNotation: The number of lexicals that are in the lexical space of xsd:date. ValidDateTimeNotation: The number of lexicals that are in the lexical space of xsd:dateTime. ValidDecimalNotation: The number of lexicals that represent a number with decimal notation and whose lexical representation is thereby in the lexical space of xsd:decimal, xsd:float, and xsd:double. ValidExponentialNotation: The number of lexicals that represent a number with exponential notation and whose lexical representation is thereby in the lexical space of xsd:float, and xsd:double. ValidInfOrNaNNotation: The number of lexicals that equals either INF, +INF, -INF or NaN and whose lexical representation is thereby in the lexical space of xsd:float, and xsd:double. ValidIntegerNotation: The number of lexicals that represent an integer number and whose lexical representation is thereby in the lexical space of xsd:integer, xsd:decimal, xsd:float, and xsd:double. ValidTimeNotation: The number of lexicals that are in the lexical space of xsd:time. ValidTrueOrFalseNotation: The number of lexicals that equal either true or false and whose lexical representation is thereby in the lexical space of xsd:boolean. ValidZeroOrOneNotation: The number of lexicals that equal either 0 or 1 and whose lexical representation is thereby in the lexical space of xsd:boolean, and xsd:integer, xsd:decimal, xsd:float, and xsd:double. Note: Lexical representation of xsd:double values in embedded JSON-LD got normalized to always use exponential notation with up to 16 fractional digits (see related code). Be careful by drawing conclusions from according Valid… and Unprecise… measures. PROPERTY: The property that has been measured. DATATYPE: The datatype that has been measured. QUANTITY: The count of statements that fulfill the condition specified by the measurement per file, property and datatype. Preview "CATEGORY","FILE_URL","MEASUREMENT","PROPERTY","DATATYPE","QUANTITY" "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2021-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","https://www.w3.org/2006/vcard/ns#longitude","https://www.w3.org/2001/XMLSchema#float","1" "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2021-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","https://www.w3.org/2006/vcard/ns#latitude","https://www.w3.org/2001/XMLSchema#float","1" "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2021-12/quads/dpef.html-rdfa.nq-00000.gz","UnpreciseRepresentableInDouble","https://purl.org/goodrelations/v1#hasCurrencyValue","https://www.w3.org/2001/XMLSchema#float","6" … "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2021-12/quads/dpef.html-embedded-jsonld.nq-06239.gz","ValidZeroOrOneNotation","http://schema.org/ratingValue","http://www.w3.org/2001/XMLSchema#integer","96" "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2021-12/quads/dpef.html-embedded-jsonld.nq-06239.gz","ValidZeroOrOneNotation","http://schema.org/minValue","http://www.w3.org/2001/XMLSchema#integer","164" "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2021-12/quads/dpef.html-embedded-jsonld.nq-06239.gz","ValidZeroOrOneNotation","http://schema.org/width","http://www.w3.org/2001/XMLSchema#integer","361" Note: The data contain malformed IRIs, like "xsd:dateTime" (instead of probably "http://www.w3.org/2001/XMLSchema#dateTime"), which are caused by missing namespace definitions ...

  20. Web Data Commons Phones Dataset, Augmented Version, Fixed Splits

    • linkagelibrary.icpsr.umich.edu
    delimited
    Updated Nov 23, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anna Primpeli; Christian Bizer (2020). Web Data Commons Phones Dataset, Augmented Version, Fixed Splits [Dataset]. http://doi.org/10.3886/E127243V1
    Explore at:
    delimitedAvailable download formats
    Dataset updated
    Nov 23, 2020
    Dataset provided by
    University of Mannheim (Germany)
    Authors
    Anna Primpeli; Christian Bizer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Motivation: Entity Matching is the task of determining which records from different data sources describe the same real-world entity. It is an important task for data integration and has been the focus of many research works. A large number of entity matching/record linkage tasks has been made available for evaluating entity matching methods. However, the lack of fixed development and test splits as well as correspondence sets including both matching and non-matching record pairs hinders the reproducibility and comparability of benchmark experiments. In an effort to enhance the reproducibility and comparability of the experiments, we complement existing entity matching benchmark tasks with fixed sets of non-matching pairs as well as fixed development and test splits. Dataset Description: An augmented version of the wdc phones dataset for benchmarking entity matching/record linkage methods found at:http://webdatacommons.org/productcorpus/index.html#toc4 The augmented version adds fixed splits for training, validation and testing as well as their corresponding feature vectors. The feature vectors are built using data type specific similarity metrics.The dataset contains 447 records describing products deriving from 17 e-shops which are matched against a product catalog of 50 products. The gold standards have manual annotations for 258 matching and 22,092 non-matching pairs. The total number of attributes used to decribe the product records are 26 while the attribute density is 0.25. The augmented dataset enhances the reproducibility of matching methods and the comparability of matching results. The dataset is part of the CompERBench repository which provides 21 complete benchmark tasks for entity matching for public download: http://data.dws.informatik.uni-mannheim.de/benchmarkmatchingtasks/index.html

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Hunter Kempf (2022). Phishing Website HTML Classification [Dataset]. https://www.kaggle.com/datasets/huntingdata11/phishing-website-html-classification
Organization logo

Phishing Website HTML Classification

Full HTML files showing example Phishing and Non-Phishing Websites

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 14, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Hunter Kempf
License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

This Dataset is a collection of HTML files that include examples of Phishing websites and Non-Phishing Websites and can be used to build Classification models on the website content. I created this dataset as a part of my Practicum project for my Masters in Cybersecurity from Georgia Tech.

Cover Photo Source: Photo by Clive Kim from Pexels: https://www.pexels.com/photo/fishing-sea-dawn-landscape-5887837/

Search
Clear search
Close search
Google apps
Main menu