100+ datasets found

F
OER sample data-set
data.uni-hannover.de
csv
Updated Jan 20, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
L3S (2022). OER sample data-set [Dataset]. https://data.uni-hannover.de/dataset/oer-sample-data-set
Explore at:
csvAvailable download formats
Dataset updated
Jan 20, 2022
Dataset authored and provided by
L3S
License
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Description
This data-set includes information about a sample of 8,887 of Open Educational Resources (OERs) from SkillsCommons website. It contains title, description, URL, type, availability date, issued date, subjects, and the availability of following metadata: level, time_required to finish, and accessibility.

This data-set has been used to build a metadata scoring and quality prediction model for OERs.
Dataset for Privacy Exercises
kaggle.com
zip
Updated Apr 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shining (2024). Dataset for Privacy Exercises [Dataset]. https://www.kaggle.com/datasets/shiningana/dataset-for-privacy-exercises
Explore at:
zip(7327312 bytes)Available download formats
Dataset updated
Apr 9, 2024
Authors
Shining
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset gives some data of a hypothetical business that can be used to practice your privacy data transformation and analysis skills.

The dataset contains the following files/tables: 1. customer_orders_for_privacy_exercises.csv contains data of a business about customer orders (columns separated by commas) 2. users_web_browsing_for_privacy_exercises.csv contains data collected by the business website about its users (columns separated by commas) 3. iot_example.csv contains data collected by a smart device on users' bio-metric data (columns separated by commas) 4. members.csv contains data collected by a library on its users (columns separated by commas)
B
Data Management Plan Examples Database
borealisdata.ca
search.dataone.org
Updated Aug 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rebeca Gaston Jothyraj; Shrey Acharya; Isaac Pratt; Danica Evering; Sarthak Behal (2024). Data Management Plan Examples Database [Dataset]. http://doi.org/10.5683/SP3/SDITUG
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/SDITUG
Dataset updated
Aug 27, 2024
Dataset provided by
Borealis
Authors
Rebeca Gaston Jothyraj; Shrey Acharya; Isaac Pratt; Danica Evering; Sarthak Behal
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Time period covered
2011 - 2024
Description
This dataset is comprised of a collection of example DMPs from a wide array of fields; obtained from a number of different sources outlined in the README. Data included/extracted from the examples included the discipline and field of study, author, institutional affiliation and funding information, location, date modified, title, research and data-type, description of project, link to the DMP, and where possible external links to related publications, grant pages, or French language versions. This CSV document serves as the content for a McMaster Data Management Plan (DMP) Database as part of the Research Data Management (RDM) Services website, located at https://u.mcmaster.ca/dmps. Other universities and organizations are encouraged to link to the DMP Database or use this dataset as the content for their own DMP Database. This dataset will be updated regularly to include new additions and will be versioned as such. We are gathering submissions at https://u.mcmaster.ca/submit-a-dmp to continue to expand the collection.
d
SAMPLE DATASET
staging-elsevier.digitalcommonsdata.com
Updated Jul 10, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FirstName+36125284 LastName+36125284 (2019). SAMPLE DATASET [Dataset]. http://doi.org/10.1234/tgpfnk7zyt.19
Explore at:
Unique identifier
https://doi.org/10.1234/tgpfnk7zyt.19
Dataset updated
Jul 10, 2019
Authors
FirstName+36125284 LastName+36125284
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the description of a dataset. The description can be quite long and this can look strange in the public dataset page. In the drafts page there is a scrollbar in the scrollbar, why not in the public page? Well, the public page needs to support viewing on a mobile phone and this can make scroll bars within scrollbars within scrollbars a little difficult. So maybe it’ll be better to try using ellipses. Additionally only adding a description does not make it a new version.
Website Screenshots Dataset
universe.roboflow.com
kaggle.com
zip
Updated Aug 19, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roboflow (2022). Website Screenshots Dataset [Dataset]. https://universe.roboflow.com/roboflow-gw7yv/website-screenshots/model/1
Explore at:
zipAvailable download formats
Dataset updated
Aug 19, 2022
Dataset authored and provided by
Roboflowhttps://roboflow.com/
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Variables measured
Elements Bounding Boxes
Description
About This Dataset

The Roboflow Website Screenshots dataset is a synthetically generated dataset composed of screenshots from over 1000 of the world's top websites. They have been automatically annotated to label the following classes: :fa-spacer: * button - navigation links, tabs, etc. * heading - text that was enclosed in <h1> to <h6> tags. * link - inline, textual <a> tags. * label - text labeling form fields. * text - all other text. * image - <img>, <svg>, or <video> tags, and icons. * iframe - ads and 3rd party content.

Example

This is an example image and annotation from the dataset: https://i.imgur.com/mOG3u3Z.png" alt="WIkipedia Screenshot">

Usage

Annotated screenshots are very useful in Robotic Process Automation. But they can be expensive to label. This dataset would cost over $4000 for humans to label on popular labeling services. We hope this dataset provides a good starting point for your project. Try it with a model from our model library.

Collecting Custom Data

Roboflow is happy to provide a custom screenshots dataset to meet your particular needs. We can crawl public or internal web applications. Just reach out and we'll be happy to provide a quote!

About Roboflow

Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless. :fa-spacer: Developers reduce 50% of their boilerplate code when using Roboflow's workflow, save training time, and increase model reproducibility. :fa-spacer:
h
amazon-product-data-sample
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Iftach Arbel, amazon-product-data-sample [Dataset]. https://huggingface.co/datasets/iarbel/amazon-product-data-sample
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Iftach Arbel
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for "amazon-product-data-filter"

Dataset Summary

The Amazon Product Dataset contains product listing data from the Amazon US website. It can be used for various NLP and classification tasks, such as text generation, product type classification, attribute extraction, image recognition and more. NOTICE: This is a sample of the full Amazon Product Dataset, which contains 1K examples. Follow the link to gain access to the full dataset.

Languages… See the full description on the dataset page: https://huggingface.co/datasets/iarbel/amazon-product-data-sample.
z
Requirements data sets (user stories)
zenodo.org
data.mendeley.com
txt
Updated Jan 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabiano Dalpiaz; Fabiano Dalpiaz (2025). Requirements data sets (user stories) [Dataset]. http://doi.org/10.17632/7zbk8zsd8y.1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.17632/7zbk8zsd8y.1
Dataset updated
Jan 13, 2025
Dataset provided by
Mendeley Data
Authors
Fabiano Dalpiaz; Fabiano Dalpiaz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A collection of 22 data set of 50+ requirements each, expressed as user stories.

The dataset has been created by gathering data from web sources and we are not aware of license agreements or intellectual property rights on the requirements / user stories. The curator took utmost diligence in minimizing the risks of copyright infringement by using non-recent data that is less likely to be critical, by sampling a subset of the original requirements collection, and by qualitatively analyzing the requirements. In case of copyright infringement, please contact the dataset curator (Fabiano Dalpiaz, f.dalpiaz@uu.nl) to discuss the possibility of removal of that dataset [see Zenodo's policies]

The data sets have been originally used to conduct experiments about ambiguity detection with the REVV-Light tool: https://github.com/RELabUU/revv-light

This collection has been originally published in Mendeley data: https://data.mendeley.com/datasets/7zbk8zsd8y/1

Overview of the datasets [data and links added in December 2024]

The following text provides a description of the datasets, including links to the systems and websites, when available. The datasets are organized by macro-category and then by identifier.

Public administration and transparency

g02-federalspending.txt (2018) originates from early data in the Federal Spending Transparency project, which pertain to the website that is used to share publicly the spending data for the U.S. government. The website was created because of the Digital Accountability and Transparency Act of 2014 (DATA Act). The specific dataset pertains a system called DAIMS or Data Broker, which stands for DATA Act Information Model Schema. The sample that was gathered refers to a sub-project related to allowing the government to act as a data broker, thereby providing data to third parties. The data for the Data Broker project is currently not available online, although the backend seems to be hosted in GitHub under a CC0 1.0 Universal license. Current and recent snapshots of federal spending related websites, including many more projects than the one described in the shared collection, can be found here.

g03-loudoun.txt (2018) is a set of extracted requirements from a document, by the Loudoun County Virginia, that describes the to-be user stories and use cases about a system for land management readiness assessment called Loudoun County LandMARC. The source document can be found here and it is part of the Electronic Land Management System and EPlan Review Project - RFP RFQ issued in March 2018. More information about the overall LandMARC system and services can be found here.

g04-recycling.txt(2017) concerns a web application where recycling and waste disposal facilities can be searched and located. The application operates through the visualization of a map that the user can interact with. The dataset has obtained from a GitHub website and it is at the basis of a students' project on web site design; the code is available (no license).

g05-openspending.txt (2018) is about the OpenSpending project (www), a project of the Open Knowledge foundation which aims at transparency about how local governments spend money. At the time of the collection, the data was retrieved from a Trello board that is currently unavailable. The sample focuses on publishing, importing and editing datasets, and how the data should be presented. Currently, OpenSpending is managed via a GitHub repository which contains multiple sub-projects with unknown license.

g11-nsf.txt (2018) refers to a collection of user stories referring to the NSF Site Redesign & Content Discovery project, which originates from a publicly accessible GitHub repository (GPL 2.0 license). In particular, the user stories refer to an early version of the NSF's website. The user stories can be found as closed Issues.

(Research) data and meta-data management

g08-frictionless.txt (2016) regards the Frictionless Data project, which offers an open source dataset for building data infrastructures, to be used by researchers, data scientists, and data engineers. Links to the many projects within the Frictionless Data project are on GitHub (with a mix of Unlicense and MIT license) and web. The specific set of user stories has been collected in 2016 by GitHub user @danfowler and are stored in a Trello board.

g14-datahub.txt (2013) concerns the open source project DataHub, which is currently developed via a GitHub repository (the code has Apache License 2.0). DataHub is a data discovery platform which has been developed over multiple years. The specific data set is an initial set of user stories, which we can date back to 2013 thanks to a comment therein.

g16-mis.txt (2015) is a collection of user stories that pertains a repository for researchers and archivists. The source of the dataset is a public Trello repository. Although the user stories do not have explicit links to projects, it can be inferred that the stories originate from some project related to the library of Duke University.

g17-cask.txt (2016) refers to the Cask Data Application Platform (CDAP). CDAP is an open source application platform (GitHub, under Apache License 2.0) that can be used to develop applications within the Apache Hadoop ecosystem, an open-source framework which can be used for distributed processing of large datasets. The user stories are extracted from a document that includes requirements regarding dataset management for Cask 4.0, which includes the scenarios, user stories and a design for the implementation of these user stories. The raw data is available in the following environment.

g18-neurohub.txt (2012) is concerned with the NeuroHub platform, a neuroscience data management, analysis and collaboration platform for researchers in neuroscience to collect, store, and share data with colleagues or with the research community. The user stories were collected at a time NeuroHub was still a research project sponsored by the UK Joint Information Systems Committee (JISC). For information about the research project from which the requirements were collected, see the following record.

g22-rdadmp.txt (2018) is a collection of user stories from the Research Data Alliance's working group on DMP Common Standards. Their GitHub repository contains a collection of user stories that were created by asking the community to suggest functionality that should part of a website that manages data management plans. Each user story is stored as an issue on the GitHub's page.

g23-archivesspace.txt (2012-2013) refers to ArchivesSpace: an open source, web application for managing archives information. The application is designed to support core functions in archives administration such as accessioning; description and arrangement of processed materials including analog, hybrid, and
born digital content; management of authorities and rights; and reference service. The application supports collection management through collection management records, tracking of events, and a growing number of administrative reports. ArchivesSpace is open source and its
u
Behance Community Art Data
cseweb.ucsd.edu
json
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCSD CSE Research Project, Behance Community Art Data [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
Explore at:
jsonAvailable download formats
Dataset authored and provided by
UCSD CSE Research Project
Description
Likes and image data from the community art website Behance. This is a small, anonymized, version of a larger proprietary dataset.

Metadata includes

appreciates (likes)

timestamps

extracted image features

Basic Statistics:

Users: 63,497

Items: 178,788

Appreciates (likes): 1,000,000
VA Personal Health Record Sample Data
catalog.data.gov
datahub.va.gov
+4more
Updated Aug 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Veterans Affairs (2025). VA Personal Health Record Sample Data [Dataset]. https://catalog.data.gov/dataset/va-personal-health-record-sample-data
Explore at:
Dataset updated
Aug 2, 2025
Dataset provided by
United States Department of Veterans Affairshttp://va.gov/
Description
My HealtheVet (www.myhealth.va.gov) is a Personal Health Record portal designed to improve the delivery of health care services to Veterans, to promote health and wellness, and to engage Veterans as more active participants in their health care. The My HealtheVet portal enables Veterans to create and maintain a web-based PHR that provides access to patient health education information and resources, a comprehensive personal health journal, and electronic services such as online VA prescription refill requests and Secure Messaging. Veterans can visit the My HealtheVet website and self-register to create an account, although registration is not required to view the professionally-sponsored health education resources, including topics of special interest to the Veteran population. Once registered, Veterans can create a customized PHR that is accessible from any computer with Internet access.
E-Commerce Data
kaggle.com
zip
Updated Aug 17, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carrie (2017). E-Commerce Data [Dataset]. https://www.kaggle.com/datasets/carrie1/ecommerce-data
Explore at:
zip(7548686 bytes)Available download formats
Dataset updated
Aug 17, 2017
Authors
Carrie
Description
Context

Typically e-commerce datasets are proprietary and consequently hard to find among publicly available data. However, The UCI Machine Learning Repository has made this dataset containing actual transactions from 2010 and 2011. The dataset is maintained on their site, where it can be found by the title "Online Retail".

Content

"This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers."

Acknowledgements

Per the UCI Machine Learning Repository, this data was made available by Dr Daqing Chen, Director: Public Analytics group. chend '@' lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.

Image from stocksnap.io.

Inspiration

Analyses for this dataset could include time series, clustering, classification and more.
d
AI Training Data | Annotated Checkout Flows for Retail, Restaurant, and...
datarade.ai
Updated Dec 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MealMe (2024). AI Training Data | Annotated Checkout Flows for Retail, Restaurant, and Marketplace Websites [Dataset]. https://datarade.ai/data-products/ai-training-data-annotated-checkout-flows-for-retail-resta-mealme
Explore at:
Dataset updated
Dec 18, 2024
Dataset authored and provided by
MealMe
Area covered
United States of America
Description
AI Training Data | Annotated Checkout Flows for Retail, Restaurant, and Marketplace Websites Overview

Unlock the next generation of agentic commerce and automated shopping experiences with this comprehensive dataset of meticulously annotated checkout flows, sourced directly from leading retail, restaurant, and marketplace websites. Designed for developers, researchers, and AI labs building large language models (LLMs) and agentic systems capable of online purchasing, this dataset captures the real-world complexity of digital transactions—from cart initiation to final payment.

Key Features

Breadth of Coverage: Over 10,000 unique checkout journeys across hundreds of top e-commerce, food delivery, and service platforms, including but not limited to Walmart, Target, Kroger, Whole Foods, Uber Eats, Instacart, Shopify-powered sites, and more.

Actionable Annotation: Every flow is broken down into granular, step-by-step actions, complete with timestamped events, UI context, form field details, validation logic, and response feedback. Each step includes:

Page state (URL, DOM snapshot, and metadata)

User actions (clicks, taps, text input, dropdown selection, checkbox/radio interactions)

System responses (AJAX calls, error/success messages, cart/price updates)

Authentication and account linking steps where applicable

Payment entry (card, wallet, alternative methods)

Order review and confirmation

Multi-Vertical, Real-World Data: Flows sourced from a wide variety of verticals and real consumer environments, not just demo stores or test accounts. Includes complex cases such as multi-item carts, promo codes, loyalty integration, and split payments.

Structured for Machine Learning: Delivered in standard formats (JSONL, CSV, or your preferred schema), with every event mapped to action types, page features, and expected outcomes. Optional HAR files and raw network request logs provide an extra layer of technical fidelity for action modeling and RLHF pipelines.

Rich Context for LLMs and Agents: Every annotation includes both human-readable and model-consumable descriptions:

“What the user did” (natural language)

“What the system did in response”

“What a successful action should look like”

Error/edge case coverage (invalid forms, OOS, address/payment errors)

Privacy-Safe & Compliant: All flows are depersonalized and scrubbed of PII. Sensitive fields (like credit card numbers, user addresses, and login credentials) are replaced with realistic but synthetic data, ensuring compliance with privacy regulations.

Each flow tracks the user journey from cart to payment to confirmation, including:

Adding/removing items

Applying coupons or promo codes

Selecting shipping/delivery options

Account creation, login, or guest checkout

Inputting payment details (card, wallet, Buy Now Pay Later)

Handling validation errors or OOS scenarios

Order review and final placement

Confirmation page capture (including order summary details)

Why This Dataset?

Building LLMs, agentic shopping bots, or e-commerce automation tools demands more than just page screenshots or API logs. You need deeply contextualized, action-oriented data that reflects how real users interact with the complex, ever-changing UIs of digital commerce. Our dataset uniquely captures:

The full intent-action-outcome loop

Dynamic UI changes, modals, validation, and error handling

Nuances of cart modification, bundle pricing, delivery constraints, and multi-vendor checkouts

Mobile vs. desktop variations

Diverse merchant tech stacks (custom, Shopify, Magento, BigCommerce, native apps, etc.)

Use Cases

LLM Fine-Tuning: Teach models to reason through step-by-step transaction flows, infer next-best-actions, and generate robust, context-sensitive prompts for real-world ordering.

Agentic Shopping Bots: Train agents to navigate web/mobile checkouts autonomously, handle edge cases, and complete real purchases on behalf of users.

Action Model & RLHF Training: Provide reinforcement learning pipelines with ground truth “what happens if I do X?” data across hundreds of real merchants.

UI/UX Research & Synthetic User Studies: Identify friction points, bottlenecks, and drop-offs in modern checkout design by replaying flows and testing interventions.

Automated QA & Regression Testing: Use realistic flows as test cases for new features or third-party integrations.

What’s Included

10,000+ annotated checkout flows (retail, restaurant, marketplace)

Step-by-step event logs with metadata, DOM, and network context

Natural language explanations for each step and transition

All flows are depersonalized and privacy-compliant

Example scripts for ingesting, parsing, and analyzing the dataset

Flexible licensing for research or commercial use

Sample Categories Covered

Grocery delivery (Instacart, Walmart, Kroger, Target, etc.)

Restaurant takeout/delivery (Ub...
NYC STEW-MAP Staten Island organizations' website hyperlink webscrape
catalog.data.gov
s.cnmilf.com
Updated Nov 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2022). NYC STEW-MAP Staten Island organizations' website hyperlink webscrape [Dataset]. https://catalog.data.gov/dataset/nyc-stew-map-staten-island-organizations-website-hyperlink-webscrape
Explore at:
Dataset updated
Nov 21, 2022
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Area covered
Staten Island, New York
Description
The data represent web-scraping of hyperlinks from a selection of environmental stewardship organizations that were identified in the 2017 NYC Stewardship Mapping and Assessment Project (STEW-MAP) (USDA 2017). There are two data sets: 1) the original scrape containing all hyperlinks within the websites and associated attribute values (see "README" file); 2) a cleaned and reduced dataset formatted for network analysis. For dataset 1: Organizations were selected from from the 2017 NYC Stewardship Mapping and Assessment Project (STEW-MAP) (USDA 2017), a publicly available, spatial data set about environmental stewardship organizations working in New York City, USA (N = 719). To create a smaller and more manageable sample to analyze, all organizations that intersected (i.e., worked entirely within or overlapped) the NYC borough of Staten Island were selected for a geographically bounded sample. Only organizations with working websites and that the web scraper could access were retained for the study (n = 78). The websites were scraped between 09 and 17 June 2020 to a maximum search depth of ten using the snaWeb package (version 1.0.1, Stockton 2020) in the R computational language environment (R Core Team 2020). For dataset 2: The complete scrape results were cleaned, reduced, and formatted as a standard edge-array (node1, node2, edge attribute) for network analysis. See "READ ME" file for further details. References: R Core Team. (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. Version 4.0.3. Stockton, T. (2020). snaWeb Package: An R package for finding and building social networks for a website, version 1.0.1. USDA Forest Service. (2017). Stewardship Mapping and Assessment Project (STEW-MAP). New York City Data Set. Available online at https://www.nrs.fs.fed.us/STEW-MAP/data/. This dataset is associated with the following publication: Sayles, J., R. Furey, and M. Ten Brink. How deep to dig: effects of web-scraping search depth on hyperlink network analysis of environmental stewardship organizations. Applied Network Science. Springer Nature, New York, NY, 7: 36, (2022).
o
An Example Showcase - Datasets - Government of Jersey Open Data
opendata.gov.je
Updated Sep 17, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2015). An Example Showcase - Datasets - Government of Jersey Open Data [Dataset]. https://opendata.gov.je/dataset/an-example-showcase
Explore at:
Dataset updated
Sep 17, 2015
Description
This is an example Showcase demonstrating the use of Showcase tags rather than using the more restrictive 'type' dropdown. Showcase and link to datasets in use. Datasets used in an app, website or visualization, or featured in an article, report or blog post can be showcased within the CKAN website. Showcases can include an image, description, tags and external link. Showcases may contain several datasets, helping users discover related datasets being used together. Showcases can be discovered by searching and filtered by tag. Site sysadmins can promote selected users to become 'Showcase Admins' to help create, populate and maintain showcases. ckanext-showcase is intended to be a more powerful replacement for the 'Related Item' feature.
w
Amazon Web Services - Public Data Sets
data.wu.ac.at
Updated Oct 10, 2013
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Global (2013). Amazon Web Services - Public Data Sets [Dataset]. https://data.wu.ac.at/schema/datahub_io/NTYxNjkxNmYtNmZlNS00N2EwLWJkYTktZjFjZWJkNTM2MTNm
Explore at:
Dataset updated
Oct 10, 2013
Dataset provided by
Global
Description
About

From website:

Public Data Sets on AWS provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications. AWS is hosting the public data sets at no charge for the community, and like all AWS services, users pay only for the compute and storage they use for their own applications. An initial list of data sets is already available, and more will be added soon.

Previously, large data sets such as the mapping of the Human Genome and the US Census data required hours or days to locate, download, customize, and analyze. Now, anyone can access these data sets from their Amazon Elastic Compute Cloud (Amazon EC2) instances and start computing on the data within minutes. Users can also leverage the entire AWS ecosystem and easily collaborate with other AWS users. For example, users can produce or use prebuilt server images with tools and applications to analyze the data sets. By hosting this important and useful data with cost-efficient services such as Amazon EC2, AWS hopes to provide researchers across a variety of disciplines and industries with tools to enable more innovation, more quickly.
Z
Data set of the article: Using Machine Learning for Web Page Classification...
data.niaid.nih.gov
Updated Jan 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matošević, Goran; Dobša, Jasminka; Mladenić, Dunja (2021). Data set of the article: Using Machine Learning for Web Page Classification in Search Engine Optimization [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_4416122
Explore at:
Dataset updated
Jan 6, 2021
Dataset provided by
Faculty of Organization and Informatics Varaždin, University of Zagreb, 10000 Zagreb, Croatia
Faculty of Economics and Tourism, University of Pula, 52100 Pula, Croatia
Institute Jozes Stefan Ljubljana, 1000 Ljubljana, Slovenia
Authors
Matošević, Goran; Dobša, Jasminka; Mladenić, Dunja
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data of investigation published in the article: "Using Machine Learning for Web Page Classification in Search Engine Optimization"

Abstract of the article:

This paper presents a novel approach of using machine learning algorithms based on experts’ knowledge to classify web pages into three predefined classes according to the degree of content adjustment to the search engine optimization (SEO) recommendations. In this study, classifiers were built and trained to classify an unknown sample (web page) into one of the three predefined classes and to identify important factors that affect the degree of page adjustment. The data in the training set are manually labeled by domain experts. The experimental results show that machine learning can be used for predicting the degree of adjustment of web pages to the SEO recommendations—classifier accuracy ranges from 54.59% to 69.67%, which is higher than the baseline accuracy of classification of samples in the majority class (48.83%). Practical significance of the proposed approach is in providing the core for building software agents and expert systems to automatically detect web pages, or parts of web pages, that need improvement to comply with the SEO guidelines and, therefore, potentially gain higher rankings by search engines. Also, the results of this study contribute to the field of detecting optimal values of ranking factors that search engines use to rank web pages. Experiments in this paper suggest that important factors to be taken into consideration when preparing a web page are page title, meta description, H1 tag (heading), and body text—which is aligned with the findings of previous research. Another result of this research is a new data set of manually labeled web pages that can be used in further research.
m
Web page phishing detection
data.mendeley.com
narcis.nl
Updated Sep 26, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdelhakim Hannousse (2020). Web page phishing detection [Dataset]. http://doi.org/10.17632/c2gw7fy2j4.1
Explore at:
Unique identifier
https://doi.org/10.17632/c2gw7fy2j4.1
Dataset updated
Sep 26, 2020
Authors
Abdelhakim Hannousse
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The provided dataset includes 11430 URLs with 87 extracted features. The dataset are designed to be used as a a benchmark for machine learning based phishing detection systems. Features are from three different classes: 56 extracted from the structure and syntax of URLs, 24 extracted from the content of their correspondent pages and 7 are extracetd by querying external services. The datatset is balanced, it containes exactly 50% phishing and 50% legitimate URLs. Associated to the dataset, we provide Python scripts used for the extraction of the features for potential replication or extension.

dataset_A: contains a list a URLs together with their DOM tree objects that can be used for replication and experimenting new URL and content-based features overtaking short-time living of phishing web pages.

dataset_B: containes the extracted feature values that can be used directly as inupt to classifiers for examination. Note that the data in this dataset are indexed with URLs so that one need to remove the index before experimentation.

Datasets are constructed on May 2020. Due to huge size of dataset A, only a sample of the dataset is provided, I will try to divide into sample files and upload them one by one, for full copy, please contact directly the author at any time at: hannousse.abdelhakim@univ-guelma.dz
d
Job Postings Dataset for Labour Market Research and Insights
datarade.ai
Updated Sep 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxylabs (2023). Job Postings Dataset for Labour Market Research and Insights [Dataset]. https://datarade.ai/data-products/job-postings-dataset-for-labour-market-research-and-insights-oxylabs
Explore at:
.json, .xml, .csv, .xlsAvailable download formats
Dataset updated
Sep 20, 2023
Dataset authored and provided by
Oxylabs
Area covered
Tajikistan, Sierra Leone, Zambia, British Indian Ocean Territory, Luxembourg, Anguilla, Kyrgyzstan, Jamaica, Togo, Switzerland
Description
Introducing Job Posting Datasets: Uncover labor market insights!

Elevate your recruitment strategies, forecast future labor industry trends, and unearth investment opportunities with Job Posting Datasets.

Job Posting Datasets Source:

Indeed: Access datasets from Indeed, a leading employment website known for its comprehensive job listings.

Glassdoor: Receive ready-to-use employee reviews, salary ranges, and job openings from Glassdoor.

StackShare: Access StackShare datasets to make data-driven technology decisions.

Job Posting Datasets provide meticulously acquired and parsed data, freeing you to focus on analysis. You'll receive clean, structured, ready-to-use job posting data, including job titles, company names, seniority levels, industries, locations, salaries, and employment types.

Choose your preferred dataset delivery options for convenience:

Receive datasets in various formats, including CSV, JSON, and more. Opt for storage solutions such as AWS S3, Google Cloud Storage, and more. Customize data delivery frequencies, whether one-time or per your agreed schedule.

Why Choose Oxylabs Job Posting Datasets:

Fresh and accurate data: Access clean and structured job posting datasets collected by our seasoned web scraping professionals, enabling you to dive into analysis.

Time and resource savings: Focus on data analysis and your core business objectives while we efficiently handle the data extraction process cost-effectively.

Customized solutions: Tailor our approach to your business needs, ensuring your goals are met.

Legal compliance: Partner with a trusted leader in ethical data collection. Oxylabs is a founding member of the Ethical Web Data Collection Initiative, aligning with GDPR and CCPA best practices.

Pricing Options:

Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

Experience a seamless journey with Oxylabs:

Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.

Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.

Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.

Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

Effortlessly access fresh job posting data with Oxylabs Job Posting Datasets.
u
Social Recommendation Data
cseweb.ucsd.edu
berd-platform.de
json
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCSD CSE Research Project, Social Recommendation Data [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
Explore at:
jsonAvailable download formats
Dataset authored and provided by
UCSD CSE Research Project
Description
These datasets include ratings as well as social (or trust) relationships between users. Data are from LibraryThing (a book review website) and epinions (general consumer reviews).

Metadata includes

reviews

price paid (epinions)

helpfulness votes (librarything)

flags (librarything)
d
Aerospace Example
catalog.data.gov
s.cnmilf.com
+2more
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Aerospace Example [Dataset]. https://catalog.data.gov/dataset/aerospace-example
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
This is a textbook, created example for illustration purposes. The System takes inputs of Pt, Ps, and Alt, and calculates the Mach number using the Rayleigh Pitot Tube equation if the plane is flying supersonically. (See Anderson.) The unit calculates Cd given the Ma and Alt. For more details, see the NASA TM, also on this website.
G
Data from: Low-Temperature Geothermal Geospatial Datasets: An Example from...
gdr.openei.org
data.openei.org
+3more
Updated Feb 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Estefanny Davalos Elizondo; Amanda Kolker; Ian Warren; Estefanny Davalos Elizondo; Amanda Kolker; Ian Warren (2023). Low-Temperature Geothermal Geospatial Datasets: An Example from Alaska [Dataset]. http://doi.org/10.15121/1997233
Explore at:
Unique identifier
https://doi.org/10.15121/1997233
Dataset updated
Feb 6, 2023
Dataset provided by
Office of Energy Efficiency and Renewable Energyhttp://energy.gov/eere
Geothermal Data Repository
National Renewable Energy Laboratory
Authors
Estefanny Davalos Elizondo; Amanda Kolker; Ian Warren; Estefanny Davalos Elizondo; Amanda Kolker; Ian Warren
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Alaska
Description
This project is a component of a broader effort focused on geothermal heating and cooling (GHC) with the aim of illustrating the numerous benefits of incorporating GHC and geothermal heat exchange (GHX) into community energy planning and national decarbonization strategies. To better assist private sector investment, it is currently necessary to define and assess the potential of low-temperature geothermal resources. For shallow GHC/GHX fields, there is no formal compilation of subsurface characteristics shared among industry practitioners that can improve system design and operations. Alaska is specifically noted in this work, because heretofore, it has not received a similar focus in geothermal potential evaluations as the contiguous United States. The methodology consists of leveraging relevant data to generate a baseline geospatial dataset of low-temperature resources (less than 150 degrees C) to compare and analyze information accessible to anyone trying to understand the potential of GHC/GHX and small-scale low-temperature geothermal power in Alaska (e.g., energy modelers, communities, planners, and policymakers). Importantly, this project identifies data related to (1) the evaluation of GHC/GHX in the shallow subsurface, and (2) the evaluation of low-temperature geothermal resource availability. Additionally, data is being compiled to assess repurposing of oil and gas wells to contribute co-produced fluids toward the geothermal direct use and heating and cooling resource potential. In this work we identified new data from three different datasets of isolated geothermal systems in Alaska and bottom-hole temperature data from oil and gas wells that can be leveraged for evaluation of low-temperature geothermal resource potential. The goal of this project is to facilitate future deployment of GHC/GHX analysis and community-led programs and update the low-temperature geothermal resources assessment of Alaska. A better understanding of shallow potential for GHX will improve design and operations of highly efficient GHC systems. The deployment and impact that can be achieved for low-temperature geothermal resources will contribute to decarbonization goals and facilitate widespread electrification by shaving and shifting grid loads.

Most of the data uses WGS84 coordinate system. However, each dataset come from different sources and has a metadata file with the original coordinate system.

Facebook

Twitter

Click to copy link

Link copied

Cite

L3S (2022). OER sample data-set [Dataset]. https://data.uni-hannover.de/dataset/oer-sample-data-set

OER sample data-set

Explore at:

csvAvailable download formats

Dataset updated

Jan 20, 2022

Dataset authored and provided by

L3S

License

Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically

Description

This data-set includes information about a sample of 8,887 of Open Educational Resources (OERs) from SkillsCommons website. It contains title, description, URL, type, availability date, issued date, subjects, and the availability of following metadata: level, time_required to finish, and accessibility.

This data-set has been used to build a metadata scoring and quality prediction model for OERs.

Clear search

Close search

Google apps

Main menu

OER sample data-set

Dataset for Privacy Exercises

Data Management Plan Examples Database

SAMPLE DATASET

Website Screenshots Dataset

About This Dataset

Example

Usage

Collecting Custom Data

About Roboflow

amazon-product-data-sample

Requirements data sets (user stories)

Overview of the datasets [data and links added in December 2024]

Public administration and transparency

(Research) data and meta-data management

Behance Community Art Data

VA Personal Health Record Sample Data

E-Commerce Data

Context

Content

Acknowledgements

Inspiration

AI Training Data | Annotated Checkout Flows for Retail, Restaurant, and...

NYC STEW-MAP Staten Island organizations' website hyperlink webscrape

An Example Showcase - Datasets - Government of Jersey Open Data

Amazon Web Services - Public Data Sets

About

Data set of the article: Using Machine Learning for Web Page Classification...

Web page phishing detection

Job Postings Dataset for Labour Market Research and Insights

Social Recommendation Data

Aerospace Example

Data from: Low-Temperature Geothermal Geospatial Datasets: An Example from...

OER sample data-setSee More Versions

OER sample data-set