Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A collection of 22 data set of 50+ requirements each, expressed as user stories.
The dataset has been created by gathering data from web sources and we are not aware of license agreements or intellectual property rights on the requirements / user stories. The curator took utmost diligence in minimizing the risks of copyright infringement by using non-recent data that is less likely to be critical, by sampling a subset of the original requirements collection, and by qualitatively analyzing the requirements. In case of copyright infringement, please contact the dataset curator (Fabiano Dalpiaz, f.dalpiaz@uu.nl) to discuss the possibility of removal of that dataset [see Zenodo's policies]
The data sets have been originally used to conduct experiments about ambiguity detection with the REVV-Light tool: https://github.com/RELabUU/revv-light
This collection has been originally published in Mendeley data: https://data.mendeley.com/datasets/7zbk8zsd8y/1
The following text provides a description of the datasets, including links to the systems and websites, when available. The datasets are organized by macro-category and then by identifier.
g02-federalspending.txt
(2018) originates from early data in the Federal Spending Transparency project, which pertain to the website that is used to share publicly the spending data for the U.S. government. The website was created because of the Digital Accountability and Transparency Act of 2014 (DATA Act). The specific dataset pertains a system called DAIMS or Data Broker, which stands for DATA Act Information Model Schema. The sample that was gathered refers to a sub-project related to allowing the government to act as a data broker, thereby providing data to third parties. The data for the Data Broker project is currently not available online, although the backend seems to be hosted in GitHub under a CC0 1.0 Universal license. Current and recent snapshots of federal spending related websites, including many more projects than the one described in the shared collection, can be found here.
g03-loudoun.txt
(2018) is a set of extracted requirements from a document, by the Loudoun County Virginia, that describes the to-be user stories and use cases about a system for land management readiness assessment called Loudoun County LandMARC. The source document can be found here and it is part of the Electronic Land Management System and EPlan Review Project - RFP RFQ issued in March 2018. More information about the overall LandMARC system and services can be found here.
g04-recycling.txt
(2017) concerns a web application where recycling and waste disposal facilities can be searched and located. The application operates through the visualization of a map that the user can interact with. The dataset has obtained from a GitHub website and it is at the basis of a students' project on web site design; the code is available (no license).
g05-openspending.txt
(2018) is about the OpenSpending project (www), a project of the Open Knowledge foundation which aims at transparency about how local governments spend money. At the time of the collection, the data was retrieved from a Trello board that is currently unavailable. The sample focuses on publishing, importing and editing datasets, and how the data should be presented. Currently, OpenSpending is managed via a GitHub repository which contains multiple sub-projects with unknown license.
g11-nsf.txt
(2018) refers to a collection of user stories referring to the NSF Site Redesign & Content Discovery project, which originates from a publicly accessible GitHub repository (GPL 2.0 license). In particular, the user stories refer to an early version of the NSF's website. The user stories can be found as closed Issues.
g08-frictionless.txt
(2016) regards the Frictionless Data project, which offers an open source dataset for building data infrastructures, to be used by researchers, data scientists, and data engineers. Links to the many projects within the Frictionless Data project are on GitHub (with a mix of Unlicense and MIT license) and web. The specific set of user stories has been collected in 2016 by GitHub user @danfowler and are stored in a Trello board.
g14-datahub.txt
(2013) concerns the open source project DataHub, which is currently developed via a GitHub repository (the code has Apache License 2.0). DataHub is a data discovery platform which has been developed over multiple years. The specific data set is an initial set of user stories, which we can date back to 2013 thanks to a comment therein.
g16-mis.txt
(2015) is a collection of user stories that pertains a repository for researchers and archivists. The source of the dataset is a public Trello repository. Although the user stories do not have explicit links to projects, it can be inferred that the stories originate from some project related to the library of Duke University.
g17-cask.txt
(2016) refers to the Cask Data Application Platform (CDAP). CDAP is an open source application platform (GitHub, under Apache License 2.0) that can be used to develop applications within the Apache Hadoop ecosystem, an open-source framework which can be used for distributed processing of large datasets. The user stories are extracted from a document that includes requirements regarding dataset management for Cask 4.0, which includes the scenarios, user stories and a design for the implementation of these user stories. The raw data is available in the following environment.
g18-neurohub.txt
(2012) is concerned with the NeuroHub platform, a neuroscience data management, analysis and collaboration platform for researchers in neuroscience to collect, store, and share data with colleagues or with the research community. The user stories were collected at a time NeuroHub was still a research project sponsored by the UK Joint Information Systems Committee (JISC). For information about the research project from which the requirements were collected, see the following record.
g22-rdadmp.txt
(2018) is a collection of user stories from the Research Data Alliance's working group on DMP Common Standards. Their GitHub repository contains a collection of user stories that were created by asking the community to suggest functionality that should part of a website that manages data management plans. Each user story is stored as an issue on the GitHub's page.
g23-archivesspace.txt
(2012-2013) refers to ArchivesSpace: an open source, web application for managing archives information. The application is designed to support core functions in archives administration such as accessioning; description and arrangement of processed materials including analog, hybrid, and
born digital content; management of authorities and rights; and reference service. The application supports collection management through collection management records, tracking of events, and a growing number of administrative reports. ArchivesSpace is open source and its
As of the early of 2020, around 59 percent of surveyed respondents in China were awared that many online shopping and e-commerce mobile apps overused user permissions. Social media and messenger apps were the second app category with a low user trust in data security.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Synthesized login feature data of >33M login attempts and >3.3M users on a large-scale online service in Norway. Original data collected between February 2020 and February 2021.
This data sets aims to foster research and development for Risk-Based Authentication (RBA) systems. The data was synthesized from the real-world login behavior of more than 3.3M users at a large-scale single sign-on (SSO) online service in Norway.
The users used this SSO to access sensitive data provided by the online service, e.g., a cloud storage and billing information. We used this data set to study how the Freeman et al. (2016) RBA model behaves on a large-scale online service in the real world (see Publication). The synthesized data set can reproduce these results made on the original data set (see Study Reproduction). Beyond that, you can use this data set to evaluate and improve RBA algorithms under real-world conditions.
WARNING: The feature values are plausible, but still totally artificial. Therefore, you should NOT use this data set in productive systems, e.g., intrusion detection systems.
https://github.com/das-group/rba-dataset/raw/main/images/login-overview.png" alt="Distribution of Login Attempts Included in the Synthesized Data Set">
The data set contains the following features related to each login attempt on the SSO:
Feature | Data Type | Description | Range or Example |
---|---|---|---|
IP Address | String | IP address belonging to the login attempt | 0.0.0.0 - 255.255.255.255 |
Country | String | Country derived from the IP address | US |
Region | String | Region derived from the IP address | New York |
City | String | City derived from the IP address | Rochester |
ASN | Integer | Autonomous system number derived from the IP address | 0 - 600000 |
User Agent String | String | User agent string submitted by the client | Mozilla/5.0 (Windows NT 10.0; Win64; ... |
OS Name and Version | String | Operating system name and version derived from the user agent string | Windows 10 |
Browser Name and Version | String | Browser name and version derived from the user agent string | Chrome 70.0.3538 |
Device Type | String | Device type derived from the user agent string | (mobile , desktop , tablet , bot , unknown )1 |
User ID | Integer | Idenfication number related to the affected user account | [Random pseudonym] |
Login Timestamp | Integer | Timestamp related to the login attempt | [64 Bit timestamp] |
Round-Trip Time (RTT) [ms] | Integer | Server-side measured latency between client and server | 1 - 8600000 |
Login Successful | Boolean | True : Login was successful, False : Login failed | (true , false ) |
Is Attack IP | Boolean | IP address was found in known attacker data set | (true , false ) |
Is Account Takeover | Boolean | Login attempt was identified as account takeover by incident response team of the online service | (true , false ) |
As the data set targets RBA systems, especially the Freeman et al. (2016) model, the statistical feature probabilities between all users, globally and locally, are identical for the categorical data. All the other data was randomly generated while maintaining logical relations and timely order between the features.
The timestamps, however, are not identical and contain randomness. The feature values related to IP address and user agent string were randomly generated by publicly available data, so they were very likely not present in the real data set. The RTTs resemble real values but were randomly assigned among users per geolocation. Therefore, the RTT entries were probably in other positions in the original data set.
The country was randomly assigned per unique feature value. Based on that, we randomly assigned an ASN related to the country, and generated the IP addresses for this ASN. The cities and regions were derived from the generated IP addresses for privacy reasons and do not reflect the real logical relations from the original data set.
The device types are identical to the real data set. Based on that, we randomly assigned the OS, and based on the OS the browser information. From this information, we randomly generated the user agent string. Therefore, all the logical relations regarding the user agent are identical as in the real data set.
The RTT was randomly drawn from the login success status and synthesized geolocation data. We did this to ensure that the RTTs are realistic ones.
Due to unresolvable conflicts during the data creation, we had to assign some unrealistic IP addresses and ASNs that are not present in the real world. Nevertheless, these do not have any effects on the risk scores generated by the Freeman et al. (2016) model.
You can recognize them by the following values:
ASNs with values >= 500.000
IP addresses in the range 10.0.0.0 - 10.255.255.255 (10.0.0.0/8 CIDR range)
Based on our evaluation, this data set can reproduce our study results regarding the RBA behavior of an RBA model using the IP address (IP address, country, and ASN) and user agent string (Full string, OS name and version, browser name and version, device type) as features.
The calculated RTT significances for countries and regions inside Norway are not identical using this data set, but have similar tendencies. The same is true for the Median RTTs per country. This is due to the fact that the available number of entries per country, region, and city changed with the data creation procedure. However, the RTTs still reflect the real-world distributions of different geolocations by city.
See RESULTS.md for more details.
https://github.com/das-group/rba-dataset/raw/main/images/rtts-continents.png" alt="Median RTTs by Country">
By using the SSO service, the users agreed in the data collection and evaluation for research purposes. For study reproduction and fostering RBA research, we agreed with the data owner to create a synthesized data set that does not allow re-identification of customers.
The synthesized data set does not contain any sensitive data values, as the IP addresses, browser identifiers, login timestamps, and RTTs were randomly generated and assigned.
You can find more details on our conducted study in the following journal article:
Pump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service (2022) Stephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo Iacono. ACM Transactions on Privacy and Security
@article{Wiefling_Pump_2022,
author = {Wiefling, Stephan and Jørgensen, Paul René and Thunem, Sigurd and Lo Iacono, Luigi},
title = {Pump {Up} {Password} {Security}! {Evaluating} and {Enhancing} {Risk}-{Based} {Authentication} on a {Real}-{World} {Large}-{Scale} {Online} {Service}},
journal = {{ACM} {Transactions} on {Privacy} and {Security}},
doi = {10.1145/3546069},
publisher = {ACM},
year = {2022}
}
This data set and the contents of this repository are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. See the LICENSE file for details. If the data set is used within a publication, the following journal article has to be cited as the source of the data set:
Stephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo Iacono: Pump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service. In: ACM Transactions on Privacy and Security (2022). doi: 10.1145/3546069
Few (invalid) user agents strings from the original data set could not be parsed, so their device type is empty. Perhaps this parse error is useful information for your studies, so we kept these 1526 entries. ↩
This statistic illustrates the average mobile data consumption of Netflix users in Italy from January to May 2018. According to data tracked by Walletsaver over the period of consideration, the average mobile data consumption of mobile users for Netflix went from 12 megabytes (MB) in January 2018 to 22 MB in May 2018.
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Internet use in the UK annual estimates by age, sex, disability, ethnic group, economic activity and geographical location, including confidence intervals.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the relation between users, instruments, elapsed time, result and status_code.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This guide will introduce the open data resources available in the CA Nature website and familiarize you with key features and capabilities of the site.
CA Nature is an online Geographic Information System (or GIS), that collects a suite of publicly accessible interactive digital mapping tools and data.
Beginning March 1, 2022, the "COVID-19 Case Surveillance Public Use Data" will be updated on a monthly basis. This case surveillance public use dataset has 12 elements for all COVID-19 cases shared with CDC and includes demographics, any exposure history, disease severity indicators and outcomes, presence of any underlying medical conditions and risk behaviors, and no geographic data. CDC has three COVID-19 case surveillance datasets: COVID-19 Case Surveillance Public Use Data with Geography: Public use, patient-level dataset with clinical data (including symptoms), demographics, and county and state of residence. (19 data elements) COVID-19 Case Surveillance Public Use Data: Public use, patient-level dataset with clinical and symptom data and demographics, with no geographic data. (12 data elements) COVID-19 Case Surveillance Restricted Access Detailed Data: Restricted access, patient-level dataset with clinical and symptom data, demographics, and state and county of residence. Access requires a registration process and a data use agreement. (32 data elements) The following apply to all three datasets: Data elements can be found on the COVID-19 case report form located at www.cdc.gov/coronavirus/2019-ncov/downloads/pui-form.pdf. Data are considered provisional by CDC and are subject to change until the data are reconciled and verified with the state and territorial data providers. Some data cells are suppressed to protect individual privacy. The datasets will include all cases with the earliest date available in each record (date received by CDC or date related to illness/specimen collection) at least 14 days prior to the creation of the previously updated datasets. This 14-day lag allows case reporting to be stabilized and ensures that time-dependent outcome data are accurately captured. Datasets are updated monthly. Datasets are created using CDC’s operational Policy on Public Health Research and Nonresearch Data Management and Access and include protections designed to protect individual privacy. For more information about data collection and reporting, please see https://wwwn.cdc.gov/nndss/data-collection.html For more information about the COVID-19 case surveillance data, please see https://www.cdc.gov/coronavirus/2019-ncov/covid-data/faq-surveillance.html Overview The COVID-19 case surveillance database includes individual-level data reported to U.S. states and autonomous reporting entities, including New York City and the District of Columbia (D.C.), as well as U.S. territories and affiliates. On April 5, 2020, COVID-19 was added to the Nationally Notifiable Condition List and classified as “immediately notifiable, urgent (within 24 hours)” by a Council of State and Territorial Epidemiologists (CSTE) Interim Position Statement (Interim-20-ID-01). CSTE updated the position statement on August 5, 2020 to clarify the interpretation of antigen detection tests and serologic test results within the case classification. The statement also recommended that all states and territories enact laws to make COVID-19 reportable in their jurisdiction, and that jurisdictions conducting surveillance should submit case notifications to CDC. COVID-19 case surveillance data are collected by jurisdictions and reported volun
Find User Profiles Data with LinkedIn profiles for nonprofit and NGO executives, managers, and administrators worldwide. Includes verified contact details, organizational affiliations, and professional histories. Best price guaranteed.
Success.ai’s User Profiles Data for Nonprofit and NGO Leaders provides businesses, organizations, and researchers with comprehensive access to global leaders in the nonprofit and NGO sectors. With data sourced from over 700 million verified LinkedIn profiles, this dataset includes actionable insights and contact details for executives, program managers, administrators, and decision-makers. Whether your goal is to partner with nonprofits, support global causes, or conduct research into social impact, Success.ai ensures your outreach is backed by accurate, enriched, and continuously updated data.
Why Choose Success.ai’s User Profiles Data for Nonprofit and NGO Leaders? Comprehensive Professional Profiles
Access verified LinkedIn profiles of nonprofit leaders, NGO managers, program directors, grant writers, and administrative executives. AI-driven validation ensures 99% accuracy for efficient communication and minimized bounce rates. Global Coverage Across Nonprofit Sectors
Includes profiles from nonprofits, humanitarian organizations, environmental groups, social enterprises, and advocacy organizations. Covers key markets across North America, Europe, APAC, South America, and Africa for global reach. Continuously Updated Dataset
Reflects real-time professional updates, organizational changes, and emerging trends in the nonprofit landscape to keep your targeting relevant and effective. Tailored for Nonprofit Insights
Enriched profiles include work histories, organizational affiliations, areas of expertise, and social impact projects for deeper engagement opportunities. Data Highlights: 700M+ Verified LinkedIn Profiles: Access a vast network of nonprofit and NGO professionals worldwide. 100M+ Work Emails: Direct communication with executives, managers, and decision-makers in the nonprofit sector. Enriched Organizational Data: Gain insights into leadership structures, mission focuses, and operational scales. Industry-Specific Segmentation: Target nonprofits focused on healthcare, education, environmental sustainability, human rights, and more. Key Features of the Dataset: Nonprofit and NGO Leader Profiles
Identify and connect with executives, program managers, fundraisers, and policy directors in global nonprofit and NGO sectors. Engage with individuals who drive decision-making and operational strategies for impactful organizations. Detailed Organizational Insights
Leverage firmographic data, including organizational size, mission, regional activity, and funding sources, to align with specific nonprofit goals. Advanced Filters for Precision Targeting
Refine searches by region, mission type, role, or organizational focus for tailored outreach. Customize campaigns based on social impact priorities, such as climate action, gender equality, or economic development. AI-Driven Enrichment
Enhanced datasets provide actionable insights into professional accomplishments, partnerships, and leadership achievements for targeted engagement. Strategic Use Cases: Partnership Development and Outreach
Identify nonprofits and NGOs for collaboration on social impact projects, sponsorships, or grant distribution. Build relationships with decision-makers driving advocacy, fundraising, and community initiatives. Donor Engagement and Fundraising
Target nonprofit leaders responsible for managing fundraising campaigns and donor relationships. Tailor outreach efforts to align with specific causes and funding priorities. Research and Analysis
Analyze leadership trends, mission focuses, and regional nonprofit activities to inform program design and funding strategies. Use insights to evaluate the effectiveness of social impact initiatives and partnerships. Recruitment and Talent Acquisition
Target HR professionals and administrators seeking qualified staff, consultants, or volunteers for nonprofits and NGOs. Offer talent solutions for specialized roles in program management, advocacy, and administration. Why Choose Success.ai? Best Price Guarantee
Access industry-leading, verified User Profiles Data at unmatched pricing to ensure your campaigns are cost-effective and impactful. Seamless Integration
Easily integrate verified nonprofit data into your CRM or marketing platforms with APIs or downloadable formats. AI-Validated Accuracy
Rely on 99% accuracy to minimize wasted outreach efforts and maximize engagement outcomes. Customizable Solutions
Tailor datasets to focus on specific nonprofit types, geographical regions, or areas of social impact to meet your strategic objectives. Strategic APIs for Enhanced Campaigns: Data Enrichment API
Update your internal records with verified nonprofit leader profiles to enhance targeting and engagement. Lead Generation API
Automate lead generation for a consistent pipeline of nonprofit and NGO professionals, scaling your outreach efforts efficiently. Success.ai’s User Profiles Data for Nonprofit and NGO Leader...
## Overview
Ppe User is a dataset for object detection tasks - it contains User annotations for 18,530 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Success.ai presents an unmatched opportunity with its User Profiles Data, offering in-depth access to LinkedIn profiles and company data that empowers businesses to develop ideal customer profiles, enrich company data, and sharpen competitive intelligence. Our LinkedIn Data Solutions are crafted to support your B2B strategies, providing a foundation for sales data enrichment and strategic market positioning.
Key Use Cases:
Why Success.ai is the Preferred Choice:
By choosing Success.ai, you gain access to a wealth of LinkedIn and user profile data that will enhance your market understanding, enrich customer interactions, and enable effective competitive strategies. Our extensive databases are the cornerstone of successful B2B engagements and strategic business planning.
Get Started with Success.ai Now: Explore the potential of detailed LinkedIn data in your business strategy. Reach out to us for a consultation or start integrating our tailored data solutions today.
And no one beats us on price. Period.
Johnyquest7/OctoTools-Gradio-Demo-User-Data dataset hosted on Hugging Face and contributed by the HF Datasets community
As of January 2025, around 13.7 percent of paid iOS apps admitted collecting data from users engaging with their mobile products. In comparison, approximately 53 percent of free-to-download iOS apps reported they collect private data from users worldwide, while approximately 86 percent of paid apps have not declared whether they collect users' privacy data.
This page pulls together resources for various types of data.wa.gov users, including developers, publishers and data users.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset was derived from the full harvest of Trove public tags. It contains a list of unique tags and the total number of resources in Trove each tag is attached to. It is formatted as a CSV file with the following columns:
tag
– the tag string
count
– number of resources the tag has been applied to
User content added to Trove, including tags, is available for reuse under a CC-BY-NC-SA licence.
The Landsat 7 Data Users Handbook is prepared by the U.S. Geological Survey (USGS) Landsat Project Science Office at the Earth Resources Observation and Science (EROS) Center in Sioux Falls, SD, and the National Aeronautics and Space Administration (NASA) Landsat Project Science Office at NASA’s Goddard Space Flight Center (GSFC) in Greenbelt, Maryland. The purpose of this handbook is to provide a basic understanding and associated reference material for the Landsat 7 Observatory and its science data products. In doing so, this document does not include a detailed description of all technical details of the Landsat 7 mission but focuses on the information needed by the users to gain an understanding of the science data products. Executive Summary The Landsat 7 Data Users Handbook is prepared by the U.S. Geological Survey (USGS) Landsat Project Science Office at the Earth Resources Observation and Science (EROS) Center in Sioux Falls, SD, and the National Aeronautics and Space Administration (NASA) Landsat Project Science Office at NASA’s Goddard Space Flight Center (GSFC) in Greenbelt, Maryland. The purpose of this handbook is to provide a basic understanding and associated reference material for the Landsat 7 Observatory and its science data products. In doing so, this document does not include a detailed description of all technical details of the Landsat 7 mission but focuses on the information needed by the users to gain an understanding of the science data products. This handbook includes various sections that provide an overview of reference material and a more detailed description of applicable data user and product information. This document includes the following sections:• Section 1 describes the background for the Landsat 7 mission, as well as previous Landsat missions • Section 2 provides a comprehensive overview of the Landsat 7 concept of operations, the Observatory, including the spacecraft, the Enhanced Thematic Mapper Plus (ETM+) instrument, the Landsat 7 ground system, as well as various institutional services • Section 3 provides information on radiometric and geometric instrument calibration, as well as a description of the Calibration Parameter File (CPF) • Section 4 includes information about the Landsat 7 Long-Term Acquisition Plan (LTAP) and documents the changes in data acquisition scheduling since launch • Section 5 includes a description of Level 1 data products and product generation, as well as conversion of Digital Numbers (DNs) to physical units • Section 6 provides an overview of data search and access using various online tools • Appendix A contains a list of known issues associated with Landsat 7 data • Appendix B includes information about the CPF content • Appendix C contains historical information pertaining to Landsat 7 ETM+ and Landsat 5 Thematic Mapper (TM) cross-calibration • Appendix D includes historical information about Level 0 Reformatted (L0R) and Level 1 Reformatted (L1R) data products • Appendix E details the differences between Level 1 Product Generation System (LPGS) and National Land Archive Production System (NLAPS) products
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Active Internet Users: % of Population: NC: Republic of Dagestan data was reported at 95.600 % in 2023. This records an increase from the previous number of 91.000 % for 2022. Active Internet Users: % of Population: NC: Republic of Dagestan data is updated yearly, averaging 79.700 % from Dec 2014 (Median) to 2023, with 10 observations. The data reached an all-time high of 95.600 % in 2023 and a record low of 46.300 % in 2014. Active Internet Users: % of Population: NC: Republic of Dagestan data remains active status in CEIC and is reported by Federal State Statistics Service. The data is categorized under Russia Premium Database’s Transport and Telecommunications Sector – Table RU.TH001: Share of Active Internet Users: by Region.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
AIT Log Data Sets
This repository contains synthetic log data suitable for evaluation of intrusion detection systems. The logs were collected from four independent testbeds that were built at the Austrian Institute of Technology (AIT) following the approach by Landauer et al. (2020) [1]. Please refer to the paper for more detailed information on automatic testbed generation and cite it if the data is used for academic publications. In brief, each testbed simulates user accesses to a webserver that runs Horde Webmail and OkayCMS. The duration of the simulation is six days. On the fifth day (2020-03-04) two attacks are launched against each web server.
The archive AIT-LDS-v1_0.zip contains the directories "data" and "labels".
The data directory is structured as follows. Each directory mail.
Setup details of the web servers:
OS: Debian Stretch 9.11.6
Services:
Apache2
PHP7
Exim 4.89
Horde 5.2.22
OkayCMS 2.3.4
Suricata
ClamAV
MariaDB
Setup details of user machines:
OS: Ubuntu Bionic
Services:
Chromium
Firefox
User host machines are assigned to web servers in the following way:
mail.cup.com is accessed by users from host machines user-{0, 1, 2, 6}
mail.spiral.com is accessed by users from host machines user-{3, 5, 8}
mail.insect.com is accessed by users from host machines user-{4, 9}
mail.onion.com is accessed by users from host machines user-{7, 10}
The following attacks are launched against the web servers (different starting times for each web server, please check the labels for exact attack times):
Attack 1: multi-step attack with sequential execution of the following attacks:
nmap scan
nikto scan
smtp-user-enum tool for account enumeration
hydra brute force login
webshell upload through Horde exploit (CVE-2019-9858)
privilege escalation through Exim exploit (CVE-2019-10149)
Attack 2: webshell injection through malicious cookie (CVE-2019-16885)
Attacks are launched from the following user host machines. In each of the corresponding directories user-
user-6 attacks mail.cup.com
user-5 attacks mail.spiral.com
user-4 attacks mail.insect.com
user-7 attacks mail.onion.com
The log data collected from the web servers includes
Apache access and error logs
syscall logs collected with the Linux audit daemon
suricata logs
exim logs
auth logs
daemon logs
mail logs
syslogs
user logs
Note that due to their large size, the audit/audit.log files of each server were compressed in a .zip-archive. In case that these logs are needed for analysis, they must first be unzipped.
Labels are organized in the same directory structure as logs. Each file contains two labels for each log line separated by a comma, the first one based on the occurrence time, the second one based on similarity and ordering. Note that this does not guarantee correct labeling for all lines and that no manual corrections were conducted.
Version history and related data sets:
AIT-LDS-v1.0: Four datasets, logs from single host, fine-granular audit logs, mail/CMS.
AIT-LDS-v1.1: Removed carriage return of line endings in audit.log files.
AIT-LDS-v2.0: Eight datasets, logs from all hosts, system logs and network traffic, mail/CMS/cloud/web.
Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU project GUARD (833456).
If you use the dataset, please cite the following publication:
[1] M. Landauer, F. Skopik, M. Wurzenberger, W. Hotwagner and A. Rauber, "Have it Your Way: Generating Customized Log Datasets With a Model-Driven Simulation Testbed," in IEEE Transactions on Reliability, vol. 70, no. 1, pp. 402-415, March 2021, doi: 10.1109/TR.2020.3031317. [PDF]
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
China Internet User Structure: Mobile Phone: Student data was reported at 34.300 % in Jun 2011. This records a decrease from the previous number of 35.600 % for Dec 2010. China Internet User Structure: Mobile Phone: Student data is updated semiannually, averaging 34.300 % from Dec 2009 (Median) to Jun 2011, with 3 observations. The data reached an all-time high of 35.600 % in Dec 2010 and a record low of 32.300 % in Dec 2009. China Internet User Structure: Mobile Phone: Student data remains active status in CEIC and is reported by China Internet Network Information Center. The data is categorized under China Premium Database’s Information and Communication Sector – Table CN.ICE: Internet: Internet User Structure: Mobile Phone.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A collection of 22 data set of 50+ requirements each, expressed as user stories.
The dataset has been created by gathering data from web sources and we are not aware of license agreements or intellectual property rights on the requirements / user stories. The curator took utmost diligence in minimizing the risks of copyright infringement by using non-recent data that is less likely to be critical, by sampling a subset of the original requirements collection, and by qualitatively analyzing the requirements. In case of copyright infringement, please contact the dataset curator (Fabiano Dalpiaz, f.dalpiaz@uu.nl) to discuss the possibility of removal of that dataset [see Zenodo's policies]
The data sets have been originally used to conduct experiments about ambiguity detection with the REVV-Light tool: https://github.com/RELabUU/revv-light
This collection has been originally published in Mendeley data: https://data.mendeley.com/datasets/7zbk8zsd8y/1
The following text provides a description of the datasets, including links to the systems and websites, when available. The datasets are organized by macro-category and then by identifier.
g02-federalspending.txt
(2018) originates from early data in the Federal Spending Transparency project, which pertain to the website that is used to share publicly the spending data for the U.S. government. The website was created because of the Digital Accountability and Transparency Act of 2014 (DATA Act). The specific dataset pertains a system called DAIMS or Data Broker, which stands for DATA Act Information Model Schema. The sample that was gathered refers to a sub-project related to allowing the government to act as a data broker, thereby providing data to third parties. The data for the Data Broker project is currently not available online, although the backend seems to be hosted in GitHub under a CC0 1.0 Universal license. Current and recent snapshots of federal spending related websites, including many more projects than the one described in the shared collection, can be found here.
g03-loudoun.txt
(2018) is a set of extracted requirements from a document, by the Loudoun County Virginia, that describes the to-be user stories and use cases about a system for land management readiness assessment called Loudoun County LandMARC. The source document can be found here and it is part of the Electronic Land Management System and EPlan Review Project - RFP RFQ issued in March 2018. More information about the overall LandMARC system and services can be found here.
g04-recycling.txt
(2017) concerns a web application where recycling and waste disposal facilities can be searched and located. The application operates through the visualization of a map that the user can interact with. The dataset has obtained from a GitHub website and it is at the basis of a students' project on web site design; the code is available (no license).
g05-openspending.txt
(2018) is about the OpenSpending project (www), a project of the Open Knowledge foundation which aims at transparency about how local governments spend money. At the time of the collection, the data was retrieved from a Trello board that is currently unavailable. The sample focuses on publishing, importing and editing datasets, and how the data should be presented. Currently, OpenSpending is managed via a GitHub repository which contains multiple sub-projects with unknown license.
g11-nsf.txt
(2018) refers to a collection of user stories referring to the NSF Site Redesign & Content Discovery project, which originates from a publicly accessible GitHub repository (GPL 2.0 license). In particular, the user stories refer to an early version of the NSF's website. The user stories can be found as closed Issues.
g08-frictionless.txt
(2016) regards the Frictionless Data project, which offers an open source dataset for building data infrastructures, to be used by researchers, data scientists, and data engineers. Links to the many projects within the Frictionless Data project are on GitHub (with a mix of Unlicense and MIT license) and web. The specific set of user stories has been collected in 2016 by GitHub user @danfowler and are stored in a Trello board.
g14-datahub.txt
(2013) concerns the open source project DataHub, which is currently developed via a GitHub repository (the code has Apache License 2.0). DataHub is a data discovery platform which has been developed over multiple years. The specific data set is an initial set of user stories, which we can date back to 2013 thanks to a comment therein.
g16-mis.txt
(2015) is a collection of user stories that pertains a repository for researchers and archivists. The source of the dataset is a public Trello repository. Although the user stories do not have explicit links to projects, it can be inferred that the stories originate from some project related to the library of Duke University.
g17-cask.txt
(2016) refers to the Cask Data Application Platform (CDAP). CDAP is an open source application platform (GitHub, under Apache License 2.0) that can be used to develop applications within the Apache Hadoop ecosystem, an open-source framework which can be used for distributed processing of large datasets. The user stories are extracted from a document that includes requirements regarding dataset management for Cask 4.0, which includes the scenarios, user stories and a design for the implementation of these user stories. The raw data is available in the following environment.
g18-neurohub.txt
(2012) is concerned with the NeuroHub platform, a neuroscience data management, analysis and collaboration platform for researchers in neuroscience to collect, store, and share data with colleagues or with the research community. The user stories were collected at a time NeuroHub was still a research project sponsored by the UK Joint Information Systems Committee (JISC). For information about the research project from which the requirements were collected, see the following record.
g22-rdadmp.txt
(2018) is a collection of user stories from the Research Data Alliance's working group on DMP Common Standards. Their GitHub repository contains a collection of user stories that were created by asking the community to suggest functionality that should part of a website that manages data management plans. Each user story is stored as an issue on the GitHub's page.
g23-archivesspace.txt
(2012-2013) refers to ArchivesSpace: an open source, web application for managing archives information. The application is designed to support core functions in archives administration such as accessioning; description and arrangement of processed materials including analog, hybrid, and
born digital content; management of authorities and rights; and reference service. The application supports collection management through collection management records, tracking of events, and a growing number of administrative reports. ArchivesSpace is open source and its