100+ datasets found
  1. Requirements data sets (user stories)

    • zenodo.org
    • data.mendeley.com
    txt
    Updated Jan 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabiano Dalpiaz; Fabiano Dalpiaz (2025). Requirements data sets (user stories) [Dataset]. http://doi.org/10.17632/7zbk8zsd8y.1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 13, 2025
    Dataset provided by
    Mendeley Ltd.
    Authors
    Fabiano Dalpiaz; Fabiano Dalpiaz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A collection of 22 data set of 50+ requirements each, expressed as user stories.

    The dataset has been created by gathering data from web sources and we are not aware of license agreements or intellectual property rights on the requirements / user stories. The curator took utmost diligence in minimizing the risks of copyright infringement by using non-recent data that is less likely to be critical, by sampling a subset of the original requirements collection, and by qualitatively analyzing the requirements. In case of copyright infringement, please contact the dataset curator (Fabiano Dalpiaz, f.dalpiaz@uu.nl) to discuss the possibility of removal of that dataset [see Zenodo's policies]

    The data sets have been originally used to conduct experiments about ambiguity detection with the REVV-Light tool: https://github.com/RELabUU/revv-light

    This collection has been originally published in Mendeley data: https://data.mendeley.com/datasets/7zbk8zsd8y/1

    Overview of the datasets [data and links added in December 2024]

    The following text provides a description of the datasets, including links to the systems and websites, when available. The datasets are organized by macro-category and then by identifier.

    Public administration and transparency

    g02-federalspending.txt (2018) originates from early data in the Federal Spending Transparency project, which pertain to the website that is used to share publicly the spending data for the U.S. government. The website was created because of the Digital Accountability and Transparency Act of 2014 (DATA Act). The specific dataset pertains a system called DAIMS or Data Broker, which stands for DATA Act Information Model Schema. The sample that was gathered refers to a sub-project related to allowing the government to act as a data broker, thereby providing data to third parties. The data for the Data Broker project is currently not available online, although the backend seems to be hosted in GitHub under a CC0 1.0 Universal license. Current and recent snapshots of federal spending related websites, including many more projects than the one described in the shared collection, can be found here.

    g03-loudoun.txt (2018) is a set of extracted requirements from a document, by the Loudoun County Virginia, that describes the to-be user stories and use cases about a system for land management readiness assessment called Loudoun County LandMARC. The source document can be found here and it is part of the Electronic Land Management System and EPlan Review Project - RFP RFQ issued in March 2018. More information about the overall LandMARC system and services can be found here.

    g04-recycling.txt(2017) concerns a web application where recycling and waste disposal facilities can be searched and located. The application operates through the visualization of a map that the user can interact with. The dataset has obtained from a GitHub website and it is at the basis of a students' project on web site design; the code is available (no license).

    g05-openspending.txt (2018) is about the OpenSpending project (www), a project of the Open Knowledge foundation which aims at transparency about how local governments spend money. At the time of the collection, the data was retrieved from a Trello board that is currently unavailable. The sample focuses on publishing, importing and editing datasets, and how the data should be presented. Currently, OpenSpending is managed via a GitHub repository which contains multiple sub-projects with unknown license.

    g11-nsf.txt (2018) refers to a collection of user stories referring to the NSF Site Redesign & Content Discovery project, which originates from a publicly accessible GitHub repository (GPL 2.0 license). In particular, the user stories refer to an early version of the NSF's website. The user stories can be found as closed Issues.

    (Research) data and meta-data management

    g08-frictionless.txt (2016) regards the Frictionless Data project, which offers an open source dataset for building data infrastructures, to be used by researchers, data scientists, and data engineers. Links to the many projects within the Frictionless Data project are on GitHub (with a mix of Unlicense and MIT license) and web. The specific set of user stories has been collected in 2016 by GitHub user @danfowler and are stored in a Trello board.

    g14-datahub.txt (2013) concerns the open source project DataHub, which is currently developed via a GitHub repository (the code has Apache License 2.0). DataHub is a data discovery platform which has been developed over multiple years. The specific data set is an initial set of user stories, which we can date back to 2013 thanks to a comment therein.

    g16-mis.txt (2015) is a collection of user stories that pertains a repository for researchers and archivists. The source of the dataset is a public Trello repository. Although the user stories do not have explicit links to projects, it can be inferred that the stories originate from some project related to the library of Duke University.

    g17-cask.txt (2016) refers to the Cask Data Application Platform (CDAP). CDAP is an open source application platform (GitHub, under Apache License 2.0) that can be used to develop applications within the Apache Hadoop ecosystem, an open-source framework which can be used for distributed processing of large datasets. The user stories are extracted from a document that includes requirements regarding dataset management for Cask 4.0, which includes the scenarios, user stories and a design for the implementation of these user stories. The raw data is available in the following environment.

    g18-neurohub.txt (2012) is concerned with the NeuroHub platform, a neuroscience data management, analysis and collaboration platform for researchers in neuroscience to collect, store, and share data with colleagues or with the research community. The user stories were collected at a time NeuroHub was still a research project sponsored by the UK Joint Information Systems Committee (JISC). For information about the research project from which the requirements were collected, see the following record.

    g22-rdadmp.txt (2018) is a collection of user stories from the Research Data Alliance's working group on DMP Common Standards. Their GitHub repository contains a collection of user stories that were created by asking the community to suggest functionality that should part of a website that manages data management plans. Each user story is stored as an issue on the GitHub's page.

    g23-archivesspace.txt (2012-2013) refers to ArchivesSpace: an open source, web application for managing archives information. The application is designed to support core functions in archives administration such as accessioning; description and arrangement of processed materials including analog, hybrid, and
    born digital content; management of authorities and rights; and reference service. The application supports collection management through collection management records, tracking of events, and a growing number of administrative reports. ArchivesSpace is open source and its

  2. Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...

    • figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO (2023). NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python [Dataset]. http://doi.org/10.6084/m9.figshare.21967265.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Authors
    Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.

    GitHub page: https://github.com/soarsmu/NICHE

  3. b

    Walmart Datasets

    • brightdata.com
    .json, .csv, .xlsx
    Updated Mar 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data (2022). Walmart Datasets [Dataset]. https://brightdata.com/products/datasets/walmart
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset updated
    Mar 27, 2025
    Dataset authored and provided by
    Bright Data
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    Worldwide
    Description

    Use our constantly updated Walmart products dataset to get a complete snapshot of new products, categories, pricing, and consumer reviews. You may purchase the entire dataset or a customized subset, depending on your needs. Popular use cases: Identify product inventory gaps and increased demand for certain products, analyze consumer sentiment and define a pricing strategy by locating similar products and categories among your competitors. The dataset includes all major data points: product, SKU, GTIN, currency,timestamp, price,a nd more. Get your Walmart dataset today!

  4. R

    Kaggles For Traffic Dataset

    • universe.roboflow.com
    zip
    Updated Dec 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    school (2023). Kaggles For Traffic Dataset [Dataset]. https://universe.roboflow.com/school-0ljld/kaggle-datasets-for-traffic
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 25, 2023
    Dataset authored and provided by
    school
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Traffic Sign Bounding Boxes
    Description

    Kaggle Datasets For Traffic

    ## Overview
    
    Kaggle Datasets For Traffic is a dataset for object detection tasks - it contains Traffic Sign annotations for 8,122 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  5. D

    Public Dataset Access and Usage

    • data.sfgov.org
    application/rdfxml +5
    Updated Mar 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Public Dataset Access and Usage [Dataset]. https://data.sfgov.org/City-Infrastructure/Public-Dataset-Access-and-Usage/su99-qvi4
    Explore at:
    csv, application/rssxml, json, tsv, application/rdfxml, xmlAvailable download formats
    Dataset updated
    Mar 26, 2025
    Description

    A. SUMMARY This dataset is used to report on public dataset access and usage within the open data portal. Each row sums the amount of users who access a dataset each day, grouped by access type (API Read, Download, Page View, etc).

    B. HOW THE DATASET IS CREATED This dataset is created by joining two internal analytics datasets generated by the SF Open Data Portal. We remove non-public information during the process.

    C. UPDATE PROCESS This dataset is scheduled to update every 7 days via ETL.

    D. HOW TO USE THIS DATASET This dataset can help you identify stale datasets, highlight the most popular datasets and calculate other metrics around the performance and usage in the open data portal.

    Please note a special call-out for two fields: - "derived": This field shows if an asset is an original source (derived = "False") or if it is made from another asset though filtering (derived = "True"). Essentially, if it is derived from another source or not. - "provenance": This field shows if an asset is "official" (created by someone in the city of San Francisco) or "community" (created by a member of the community, not official). All community assets are derived as members of the community cannot add data to the open data portal.

  6. H

    Political Analysis Using R: Example Code and Data, Plus Data for Practice...

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Apr 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Political Analysis Using R: Example Code and Data, Plus Data for Practice Problems [Dataset]. https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ARKOTI
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 28, 2020
    Dataset provided by
    Harvard Dataverse
    Authors
    Jamie Monogan
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Each R script replicates all of the example code from one chapter from the book. All required data for each script are also uploaded, as are all data used in the practice problems at the end of each chapter. The data are drawn from a wide array of sources, so please cite the original work if you ever use any of these data sets for research purposes.

  7. g

    ARPA-E PERFORM datasets | gimi9.com

    • gimi9.com
    Updated Jul 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). ARPA-E PERFORM datasets | gimi9.com [Dataset]. https://gimi9.com/dataset/data-gov_arpa-e-perform-datasets
    Explore at:
    Dataset updated
    Jul 25, 2023
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Time-coincident load, wind, and solar data including actual and probabilistic forecast datasets at 5-min resolution for ERCOT, MISO, NYISO, and SPP. Wind and solar profiles are supplied for existing sites as well as planned sites based on interconnection queue projects as of 2021. For ERCOT actuals are provided for 2017 and 2018 and forecasts for 2018, and for the remaining ISOs actuals are provided for 2018 and 2019 and forecasts for 2019. There datasets were produced by NREL as part of the ARPA-E PERFORM project, an ARPA-E funded program that aim to use time-coincident power and load seeks to develop innovative management systems that represent the relative delivery risk of each asset and balance the collective risk of all assets across the grid. For more information on the datasets and methods used to generate them see https://github.com/PERFORM-Forecasts/documentation.

  8. Z

    AIT Log Data Set V1.1

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +1more
    Updated Oct 18, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Landauer, Max (2023). AIT Log Data Set V1.1 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3723082
    Explore at:
    Dataset updated
    Oct 18, 2023
    Dataset provided by
    Wurzenberger, Markus
    Landauer, Max
    Rauber, Andreas
    Skopik, Florian
    Hotwagner, Wolfgang
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    AIT Log Data Sets

    This repository contains synthetic log data suitable for evaluation of intrusion detection systems. The logs were collected from four independent testbeds that were built at the Austrian Institute of Technology (AIT) following the approach by Landauer et al. (2020) [1]. Please refer to the paper for more detailed information on automatic testbed generation and cite it if the data is used for academic publications. In brief, each testbed simulates user accesses to a webserver that runs Horde Webmail and OkayCMS. The duration of the simulation is six days. On the fifth day (2020-03-04) two attacks are launched against each web server.

    The archive AIT-LDS-v1_0.zip contains the directories "data" and "labels".

    The data directory is structured as follows. Each directory mail.

    Setup details of the web servers:

    OS: Debian Stretch 9.11.6

    Services:

    Apache2

    PHP7

    Exim 4.89

    Horde 5.2.22

    OkayCMS 2.3.4

    Suricata

    ClamAV

    MariaDB

    Setup details of user machines:

    OS: Ubuntu Bionic

    Services:

    Chromium

    Firefox

    User host machines are assigned to web servers in the following way:

    mail.cup.com is accessed by users from host machines user-{0, 1, 2, 6}

    mail.spiral.com is accessed by users from host machines user-{3, 5, 8}

    mail.insect.com is accessed by users from host machines user-{4, 9}

    mail.onion.com is accessed by users from host machines user-{7, 10}

    The following attacks are launched against the web servers (different starting times for each web server, please check the labels for exact attack times):

    Attack 1: multi-step attack with sequential execution of the following attacks:

    nmap scan

    nikto scan

    smtp-user-enum tool for account enumeration

    hydra brute force login

    webshell upload through Horde exploit (CVE-2019-9858)

    privilege escalation through Exim exploit (CVE-2019-10149)

    Attack 2: webshell injection through malicious cookie (CVE-2019-16885)

    Attacks are launched from the following user host machines. In each of the corresponding directories user-

    user-6 attacks mail.cup.com

    user-5 attacks mail.spiral.com

    user-4 attacks mail.insect.com

    user-7 attacks mail.onion.com

    The log data collected from the web servers includes

    Apache access and error logs

    syscall logs collected with the Linux audit daemon

    suricata logs

    exim logs

    auth logs

    daemon logs

    mail logs

    syslogs

    user logs

    Note that due to their large size, the audit/audit.log files of each server were compressed in a .zip-archive. In case that these logs are needed for analysis, they must first be unzipped.

    Labels are organized in the same directory structure as logs. Each file contains two labels for each log line separated by a comma, the first one based on the occurrence time, the second one based on similarity and ordering. Note that this does not guarantee correct labeling for all lines and that no manual corrections were conducted.

    Version history and related data sets:

    AIT-LDS-v1.0: Four datasets, logs from single host, fine-granular audit logs, mail/CMS.

    AIT-LDS-v1.1: Removed carriage return of line endings in audit.log files.

    AIT-LDS-v2.0: Eight datasets, logs from all hosts, system logs and network traffic, mail/CMS/cloud/web.

    Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU project GUARD (833456).

    If you use the dataset, please cite the following publication:

    [1] M. Landauer, F. Skopik, M. Wurzenberger, W. Hotwagner and A. Rauber, "Have it Your Way: Generating Customized Log Datasets With a Model-Driven Simulation Testbed," in IEEE Transactions on Reliability, vol. 70, no. 1, pp. 402-415, March 2021, doi: 10.1109/TR.2020.3031317. [PDF]

  9. Job Postings Dataset for Labour Market Research and Insights

    • datarade.ai
    Updated Sep 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxylabs (2023). Job Postings Dataset for Labour Market Research and Insights [Dataset]. https://datarade.ai/data-products/job-postings-dataset-for-labour-market-research-and-insights-oxylabs
    Explore at:
    .json, .xml, .csv, .xlsAvailable download formats
    Dataset updated
    Sep 20, 2023
    Dataset authored and provided by
    Oxylabs
    Area covered
    British Indian Ocean Territory, Jamaica, Luxembourg, Anguilla, Switzerland, Kyrgyzstan, Sierra Leone, Tajikistan, Zambia, Togo
    Description

    Introducing Job Posting Datasets: Uncover labor market insights!

    Elevate your recruitment strategies, forecast future labor industry trends, and unearth investment opportunities with Job Posting Datasets.

    Job Posting Datasets Source:

    1. Indeed: Access datasets from Indeed, a leading employment website known for its comprehensive job listings.

    2. Glassdoor: Receive ready-to-use employee reviews, salary ranges, and job openings from Glassdoor.

    3. StackShare: Access StackShare datasets to make data-driven technology decisions.

    Job Posting Datasets provide meticulously acquired and parsed data, freeing you to focus on analysis. You'll receive clean, structured, ready-to-use job posting data, including job titles, company names, seniority levels, industries, locations, salaries, and employment types.

    Choose your preferred dataset delivery options for convenience:

    Receive datasets in various formats, including CSV, JSON, and more. Opt for storage solutions such as AWS S3, Google Cloud Storage, and more. Customize data delivery frequencies, whether one-time or per your agreed schedule.

    Why Choose Oxylabs Job Posting Datasets:

    1. Fresh and accurate data: Access clean and structured job posting datasets collected by our seasoned web scraping professionals, enabling you to dive into analysis.

    2. Time and resource savings: Focus on data analysis and your core business objectives while we efficiently handle the data extraction process cost-effectively.

    3. Customized solutions: Tailor our approach to your business needs, ensuring your goals are met.

    4. Legal compliance: Partner with a trusted leader in ethical data collection. Oxylabs is a founding member of the Ethical Web Data Collection Initiative, aligning with GDPR and CCPA best practices.

    Pricing Options:

    Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

    Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

    Experience a seamless journey with Oxylabs:

    • Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.
    • Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.
    • Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.
    • Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

    Effortlessly access fresh job posting data with Oxylabs Job Posting Datasets.

  10. i

    DR IQA Database V2

    • ieee-dataport.org
    Updated Dec 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shahrukh Athar (2022). DR IQA Database V2 [Dataset]. http://doi.org/10.21227/8r47-gp07
    Explore at:
    Dataset updated
    Dec 23, 2022
    Dataset provided by
    IEEE Dataport
    Authors
    Shahrukh Athar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In practical media distribution systems, visual content usually undergoes multiple stages of quality degradation along the delivery chain, but the pristine source content is rarely available at most quality monitoring points along the chain to serve as a reference for quality assessment. As a result, full-reference (FR) and reduced-reference (RR) image quality assessment (IQA) methods are generally infeasible. Although no-reference (NR) methods are readily applicable, their performance is often not reliable. On the other hand, intermediate references of degraded quality are often available, e.g., at the input of video transcoders, but how to make the best use of them in proper ways has not been deeply investigated.This database is associated with a research project whose main goal is to make one of the first attempts to establish a new IQA paradigm named degraded-reference IQA (DR IQA). We initiate work on DR IQA by restricting ourselves to a two-stage distortion pipeline. Most IQA research projects rely on the availability of appropriate quality-annotated datasets. However, we find that only a few small-scale subject-rated datasets of multiply distorted images exist at the moment. These datasets contain a few hundreds of images and include the LIVE Multiply Distorted (LIVE MD), Multiply Distorted IVL (MD IVL), and LIVE Wild Compressed (LIVE WCmp) databases. Such small-scale data is not only insufficient to develop robust machine learning based IQA models, it is also not enough to perform multiple distortions behavior analysis, i.e., to study how multiple distortions behave in conjunction with each other when impacting visual content simultaneously. Surprisingly, such detailed analysis is lacking even for the case of two simultaneous distortions.We address the above-mentioned and other issues in our research project titled Degraded Reference Image Quality Assessment. As part of this project, we address the scarcity of data by constructing two large-scale datasets called DR IQA database Version 1 (V1) and DR IQA database Version 2 (V2). Each of these datasets contains 34 pristine reference (PR) images, 1,122 singly distorted degraded reference (DR) images, and 31,790 multiply distorted final distorted (FD) images, making them the largest datasets constructed in this particular area of IQA to-date. These datasets formed the basis of multiple distortion behavior analysis and DR IQA model development conducted in the above-mentioned project. We hope that the IQA research community will find them useful. Here we are releasing DR IQA database V2, while DR IQA database V1 has been separately released, also on IEEE DataPort. If you use this database in your research then please cite the following paper (Details about the DR IQA project can also be found in this paper):S. Athar and Z. Wang, "Degraded Reference Image Quality Assessment," Accepted for publication in IEEE Transactions on Image Processing, 2022.

  11. World Bank: Education Data

    • kaggle.com
    zip
    Updated Mar 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    World Bank (2019). World Bank: Education Data [Dataset]. https://www.kaggle.com/datasets/theworldbank/world-bank-intl-education
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Mar 20, 2019
    Dataset authored and provided by
    World Bankhttp://worldbank.org/
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    The World Bank is an international financial institution that provides loans to countries of the world for capital projects. The World Bank's stated goal is the reduction of poverty. Source: https://en.wikipedia.org/wiki/World_Bank

    Content

    This dataset combines key education statistics from a variety of sources to provide a look at global literacy, spending, and access.

    For more information, see the World Bank website.

    Fork this kernel to get started with this dataset.

    Acknowledgements

    https://bigquery.cloud.google.com/dataset/bigquery-public-data:world_bank_health_population

    http://data.worldbank.org/data-catalog/ed-stats

    https://cloud.google.com/bigquery/public-data/world-bank-education

    Citation: The World Bank: Education Statistics

    Dataset Source: World Bank. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://www.data.gov/privacy-policy#data_policy - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

    Banner Photo by @till_indeman from Unplash.

    Inspiration

    Of total government spending, what percentage is spent on education?

  12. c

    Statistical Regression Methods in Education Teaching Datasets: Longitudinal...

    • datacatalogue.cessda.eu
    • beta.ukdataservice.ac.uk
    Updated Nov 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cadwallader, S., University of Warwick; Strand, S., University of Warwick (2024). Statistical Regression Methods in Education Teaching Datasets: Longitudinal Study of Young People in England, 2004-2006 [Dataset]. http://doi.org/10.5255/UKDA-SN-6660-1
    Explore at:
    Dataset updated
    Nov 28, 2024
    Dataset provided by
    Institute of Education
    Authors
    Cadwallader, S., University of Warwick; Strand, S., University of Warwick
    Area covered
    England
    Variables measured
    Individuals, Families/households, National
    Measurement technique
    Compilation or synthesis of existing material
    Description

    Abstract copyright UK Data Service and data collection copyright owner.


    These teaching datasets, comprising a sub-set of a large-scale longitudinal study, the Longitudinal Study of Young People in England (LSYPE), were created as part of the NCRM Developing Statistical Modelling in the Social Sciences: Lancaster-Warwick-Stirling Node Phase 2 project, funded by the Economic and Social Research Council (ESRC). During the project, a web site was created with the aim to provide a web-based training resource about the use of statistical regression methods in educational research. The content is designed to teach users how to perform a variety of regression analyses using SPSS, starting with foundation material in basic statistics and working through to more complex multiple linear, logistic and ordinal regression models. Along with illustrated modules the site contains demonstration videos, interactive quizzes and SPSS exercises and examples that use these LSYPE teaching data. Further information and documentation may be found at the web site, Using Statistical Methods in Education Research. Throughout the site modules users are invited to use the datasets for either following the examples or performing exercises. Prospective users of the data will be directed to register an account in order to download the data.

    The full LSYPE study is held at the Archive under SN 5545. The teaching datasets include information drawn from Wave 1 of LSYPE, conducted in 2004, with GCSE results matched from Wave 3 (2006). Further information about the NCRM Node project covering this study may be found on the Developing Statistical Modelling in the Social Sciences ESRC award web page.

    Documentation
    There is currently no discrete documentation currently available with these teaching datasets; users should consult the web site noted above. Documentation covering the main LSYPE study is available with SN 5545.

    For the second edition (July 2011), updated versions of the SPSS data files were deposited to resolve minor anomalies.

    Main Topics:

    The teaching datasets include variables covering LSYPE respondents' educational test results, academic achievement and school life, and demographic/household characteristics including ethnic group, gender, social class and socio-economic status, computer ownership, private education, and mothers' occupational status and educational background.

  13. COVID-19 Case Surveillance Public Use Data

    • catalog.data.gov
    • healthdata.gov
    • +6more
    Updated Mar 3, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Centers for Disease Control and Prevention (2022). COVID-19 Case Surveillance Public Use Data [Dataset]. https://catalog.data.gov/dataset/covid-19-case-surveillance-public-use-data
    Explore at:
    Dataset updated
    Mar 3, 2022
    Dataset provided by
    Centers for Disease Control and Preventionhttp://www.cdc.gov/
    Description

    Beginning March 1, 2022, the "COVID-19 Case Surveillance Public Use Data" will be updated on a monthly basis. This case surveillance public use dataset has 12 elements for all COVID-19 cases shared with CDC and includes demographics, any exposure history, disease severity indicators and outcomes, presence of any underlying medical conditions and risk behaviors, and no geographic data. CDC has three COVID-19 case surveillance datasets: COVID-19 Case Surveillance Public Use Data with Geography: Public use, patient-level dataset with clinical data (including symptoms), demographics, and county and state of residence. (19 data elements) COVID-19 Case Surveillance Public Use Data: Public use, patient-level dataset with clinical and symptom data and demographics, with no geographic data. (12 data elements) COVID-19 Case Surveillance Restricted Access Detailed Data: Restricted access, patient-level dataset with clinical and symptom data, demographics, and state and county of residence. Access requires a registration process and a data use agreement. (32 data elements) The following apply to all three datasets: Data elements can be found on the COVID-19 case report form located at www.cdc.gov/coronavirus/2019-ncov/downloads/pui-form.pdf. Data are considered provisional by CDC and are subject to change until the data are reconciled and verified with the state and territorial data providers. Some data cells are suppressed to protect individual privacy. The datasets will include all cases with the earliest date available in each record (date received by CDC or date related to illness/specimen collection) at least 14 days prior to the creation of the previously updated datasets. This 14-day lag allows case reporting to be stabilized and ensures that time-dependent outcome data are accurately captured. Datasets are updated monthly. Datasets are created using CDC’s operational Policy on Public Health Research and Nonresearch Data Management and Access and include protections designed to protect individual privacy. For more information about data collection and reporting, please see https://wwwn.cdc.gov/nndss/data-collection.html For more information about the COVID-19 case surveillance data, please see https://www.cdc.gov/coronavirus/2019-ncov/covid-data/faq-surveillance.html Overview The COVID-19 case surveillance database includes individual-level data reported to U.S. states and autonomous reporting entities, including New York City and the District of Columbia (D.C.), as well as U.S. territories and affiliates. On April 5, 2020, COVID-19 was added to the Nationally Notifiable Condition List and classified as “immediately notifiable, urgent (within 24 hours)” by a Council of State and Territorial Epidemiologists (CSTE) Interim Position Statement (Interim-20-ID-01). CSTE updated the position statement on August 5, 2020 to clarify the interpretation of antigen detection tests and serologic test results within the case classification. The statement also recommended that all states and territories enact laws to make COVID-19 reportable in their jurisdiction, and that jurisdictions conducting surveillance should submit case notifications to CDC. COVID-19 case surveillance data are collected by jurisdictions and reported volun

  14. Kidney Disease Dataset

    • kaggle.com
    Updated Aug 4, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akshay Singh (2019). Kidney Disease Dataset [Dataset]. https://www.kaggle.com/datasets/akshayksingh/kidney-disease-dataset/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 4, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Akshay Singh
    Description

    Context

    I am new to kaggle. I have uploaded this project since i chose this topic for my final year project. No other platform other than kaggle would be great for me where I could share my work.

    Content

    The dataset is taken over 2-month period in India. It has 400 rows with 25 features like red blood cells, pedal edema, sugar,etc. The aim is to classify whether a patient has chronic kidney disease or not. The classification is based on a attribute named 'classification' which is either 'ckd'(chronic kidney disease) or 'notckd. I've performed cleaning of the dataset which includes mapping the text to numbers and some other changes. After the cleaning I've done some EDA(Exploratory Data Analysis) and then I've divided the dataset int training and testing and applied the models on them. It is observed that the classification results are not much satisfying initially. So, instead of dropping the rows with Nan values I've used the lambda function to replace them with mode for each column. After that I've divided the dataset again into training and testing sets and applied models on them. This time the results are better and we see that the random forest and decision trees are the best performers with an accuracy of 1.0 and 0 misclassifications. The performance of the classification is measured by printing confusion matrix, classification report and accuracy.

    Acknowledgements

    The dataset can be downloaded from https://archive.ics.uci.edu/ml/datasets/chronic_kidney_disease

    Inspiration

    I want to understand the approach to data science projects and work on different projects to expand my knowledge.

  15. d

    Urban Growth Projection for Southeast Regional Assessment Project

    • catalog.data.gov
    • datasets.ai
    Updated Jun 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Climate Adaptation Science Centers (2024). Urban Growth Projection for Southeast Regional Assessment Project [Dataset]. https://catalog.data.gov/dataset/urban-growth-projection-for-southeast-regional-assessment-project
    Explore at:
    Dataset updated
    Jun 15, 2024
    Dataset provided by
    Climate Adaptation Science Centers
    Description

    This dataset represents the extent of urbanization (for the year indicated) predicted by the model SLEUTH, developed by Dr. Keith C. Clarke, at the University of California, Santa Barbara, Department of Geography and modified by David I. Donato of the United States Geological Survey (USGS) Eastern Geographic Science Center (EGSC). Further model modification and implementation was performed at the Biodiversity and Spatial Information Center at North Carolina State University. Purpose: Urban growth probability extents throughout the 21st century for the Southeast Regional Assessment Project, which encompasses the states of Alabama, Florida, Georgia, Kentucky, Mississippi, North Carolina, South Carolina, Tennessee and Virginia and parts of the states of Arkansas, Illinois, Indiana, Louisiana, Maryland, Missouri, Ohio and West Virginia. Credit: Southeast Regional Assessment Project; Biodiversity and Spatial Information Center, North Carolina State University, Raleigh, North Carolina 27695, Curtis M. Belyea. Use Limitation: This data set is not intended for site-specific analyses. Interpretations derived from its use are suited for regional and planning purposes only. These data are not intended to be used at scales larger than 1:100,000. Acknowledgment of Biodiversity and Spatial Analysis Center at North Carolina State University is appreciated.

  16. g

    U.S. Geological Survey Gap Analysis Program- Land Cover Data v2.2

    • data.globalchange.gov
    • datadiscoverystudio.org
    • +3more
    Updated Jan 19, 2012
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2012). U.S. Geological Survey Gap Analysis Program- Land Cover Data v2.2 [Dataset]. https://data.globalchange.gov/dataset/usgs-gap-analysis-program-land-cover-data-v2-2167e5
    Explore at:
    Dataset updated
    Jan 19, 2012
    Description

    This dataset combines the work of several different projects to create a seamless data set for the contiguous United States. Data from four regional Gap Analysis Projects and the LANDFIRE project were combined to make this dataset. In the northwestern United States (Idaho, Oregon, Montana, Washington and Wyoming) data in this map came from the Northwest Gap Analysis Project. In the southwestern United States (Colorado, Arizona, Nevada, New Mexico, and Utah) data used in this map came from the Southwest Gap Analysis Project. The data for Alabama, Florida, Georgia, Kentucky, North Carolina, South Carolina, Mississippi, Tennessee, and Virginia came from the Southeast Gap Analysis Project and the California data was generated by the updated California Gap land cover project. The Hawaii Gap Analysis project provided the data for Hawaii. In areas of the county (central U.S., Northeast, Alaska) that have not yet been covered by a regional Gap Analysis Project, data from the Landfire project was used. Similarities in the methods used by these projects made possible the combining of the data they derived into one seamless coverage. They all used multi-season satellite imagery (Landsat ETM+) from 1999-2001 in conjunction with digital elevation model (DEM) derived datasets (e.g. elevation, landform) to model natural and semi-natural vegetation. Vegetation classes were drawn from NatureServe's Ecological System Classification (Comer et al. 2003) or classes developed by the Hawaii Gap project. Additionally, all of the projects included land use classes that were employed to describe areas where natural vegetation has been altered. In many areas of the country these classes were derived from the National Land Cover Dataset (NLCD). For the majority of classes and, in most areas of the country, a decision tree classifier was used to discriminate ecological system types. In some areas of the country, more manual techniques were used to discriminate small patch systems and systems not distinguishable through topography. The data contains multiple levels of thematic detail. At the most detailed level natural vegetation is represented by NatureServe's Ecological System classification (or in Hawaii the Hawaii GAP classification). These most detailed classifications have been crosswalked to the five highest levels of the National Vegetation Classification (NVC), Class, Subclass, Formation, Division and Macrogroup. This crosswalk allows users to display and analyze the data at different levels of thematic resolution. Developed areas, or areas dominated by introduced species, timber harvest, or water are represented by other classes, collectively refered to as land use classes; these land use classes occur at each of the thematic levels. Raster data in both ArcGIS Grid and ERDAS Imagine format is available for download at http://gis1.usgs.gov/csas/gap/viewer/land_cover/Map.aspx Six layer files are included in the download packages to assist the user in displaying the data at each of the Thematic levels in ArcGIS. In adition to the raster datasets the data is available in Web Mapping Services (WMS) format for each of the six NVC classification levels (Class, Subclass, Formation, Division, Macrogroup, Ecological System) at the following links. http://gis1.usgs.gov/arcgis/rest/services/gap/GAP_Land_Cover_NVC_Class_Landuse/MapServer http://gis1.usgs.gov/arcgis/rest/services/gap/GAP_Land_Cover_NVC_Subclass_Landuse/MapServer http://gis1.usgs.gov/arcgis/rest/services/gap/GAP_Land_Cover_NVC_Formation_Landuse/MapServer http://gis1.usgs.gov/arcgis/rest/services/gap/GAP_Land_Cover_NVC_Division_Landuse/MapServer http://gis1.usgs.gov/arcgis/rest/services/gap/GAP_Land_Cover_NVC_Macrogroup_Landuse/MapServer http://gis1.usgs.gov/arcgis/rest/services/gap/GAP_Land_Cover_Ecological_Systems_Landuse/MapServer

  17. d

    Utility Energy Registry Monthly County Energy Use: Beginning 2021

    • catalog.data.gov
    • gimi9.com
    • +1more
    Updated Sep 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.ny.gov (2024). Utility Energy Registry Monthly County Energy Use: Beginning 2021 [Dataset]. https://catalog.data.gov/dataset/utility-energy-registry-monthly-county-energy-use-beginning-2021
    Explore at:
    Dataset updated
    Sep 6, 2024
    Dataset provided by
    data.ny.gov
    Description

    The Utility Energy Registry (UER) is a database platform that provides streamlined public access to aggregated community-scale energy data. The UER is intended to promote and facilitate community-based energy planning and energy use awareness and engagement. On April 19, 2018, the New York State Public Service Commission (PSC) issued the Order Adopting the Utility Energy Registry under regulatory CASE 17-M-0315. The order requires utilities and CCA administrators under its regulation to develop and report community energy use data to the UER. This dataset includes electricity and natural gas usage data reported by utilities at the county level. Other UER datasets include energy use data reported at the city, town, and village, and ZIP code level. Data in the UER can be used for several important purposes such as planning community energy programs, developing community greenhouse gas emissions inventories, and relating how certain energy projects and policies may affect a particular community. It is important to note that the data are subject to privacy screening and fields that fail the privacy screen are withheld. The New York State Energy Research and Development Authority (NYSERDA) offers objective information and analysis, innovative programs, technical expertise, and support to help New Yorkers increase energy efficiency, save money, use renewable energy, and reduce reliance on fossil fuels. To learn more about NYSERDA’s programs, visit nyserda.ny.gov or follow us on X, Facebook, YouTube, or Instagram.

  18. d

    NFL Data (Historic Data Available) - Sports Data, National Football League...

    • datarade.ai
    Updated Sep 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    APISCRAPY (2024). NFL Data (Historic Data Available) - Sports Data, National Football League Datasets. Free Trial Available [Dataset]. https://datarade.ai/data-products/nfl-data-historic-data-available-sports-data-national-fo-apiscrapy
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Sep 26, 2024
    Dataset authored and provided by
    APISCRAPY
    Area covered
    Ireland, Poland, Iceland, Portugal, Bosnia and Herzegovina, China, Norway, Malta, Lithuania, Italy
    Description

    Our NFL Data product offers extensive access to historic and current National Football League statistics and results, available in multiple formats. Whether you're a sports analyst, data scientist, fantasy football enthusiast, or a developer building sports-related apps, this dataset provides everything you need to dive deep into NFL performance insights.

    Key Benefits:

    Comprehensive Coverage: Includes historic and real-time data on NFL stats, game results, team performance, player metrics, and more.

    Multiple Formats: Datasets are available in various formats (CSV, JSON, XML) for easy integration into your tools and applications.

    User-Friendly Access: Whether you are an advanced analyst or a beginner, you can easily access and manipulate data to suit your needs.

    Free Trial: Explore the full range of data with our free trial before committing, ensuring the product meets your expectations.

    Customizable: Filter and download only the data you need, tailored to specific seasons, teams, or players.

    API Access: Developers can integrate real-time NFL data into their apps with API support, allowing seamless updates and user engagement.

    Use Cases:

    Fantasy Football Players: Use the data to analyze player performance, helping to draft winning teams and make better game-day decisions.

    Sports Analysts: Dive deep into historical and current NFL stats for research, articles, and game predictions.

    Developers: Build custom sports apps and dashboards by integrating NFL data directly through API access.

    Betting & Prediction Models: Use data to create accurate predictions for NFL games, helping sportsbooks and bettors alike.

    Media Outlets: Enhance game previews, post-game analysis, and highlight reels with accurate, detailed NFL stats.

    Our NFL Data product ensures you have the most reliable, up-to-date information to drive your projects, whether it's enhancing user experiences, creating predictive models, or simply enjoying in-depth football analysis.

  19. m

    Data from: Active Sonar Data Set

    • data.mendeley.com
    • search.datacite.org
    Updated Oct 9, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammad Khishe (2017). Active Sonar Data Set [Dataset]. http://doi.org/10.17632/fyxjjwzphf.1
    Explore at:
    Dataset updated
    Oct 9, 2017
    Authors
    Mohammad Khishe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this data set, 6 objects including 2 targets and 4 non-targets lay on the sea sand bottom. Upon this experiment, the transmitted signal is Wide-Band Linear Frequency Modulated Pulse (WLFM) which covers frequency range 5-110 KHz. Targets lay on the bottom rotate 180 degrees with 1 degree accuracy via electromotor. Off target to 10 meters backscattered echoes are accumulated. Fine dataset takes key role in sonar target classification. Regarding massive raw data obtained from previous stage, above massive calculation will be expected. To reduce calculation burden relating to classifying and extracting feature, it is essential to detect targets out of total received data. To implement this, the intensity of the received signal is used. It is inevitable to consider multi-path propagation, secondary reflections, and reverberation due to shoal of the region. The researcher attempts to eliminate artifact tract after detecting stage and before extracting feature by the use of a matched filter.

  20. R

    Finals Dataset

    • universe.roboflow.com
    zip
    Updated Feb 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    rajagiri22 (2024). Finals Dataset [Dataset]. https://universe.roboflow.com/rajagiri22/final-datasets-pyyat
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 23, 2024
    Dataset authored and provided by
    rajagiri22
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Bounding Boxes
    Description

    Final Datasets

    ## Overview
    
    Final Datasets is a dataset for object detection tasks - it contains  annotations for 1,600 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Fabiano Dalpiaz; Fabiano Dalpiaz (2025). Requirements data sets (user stories) [Dataset]. http://doi.org/10.17632/7zbk8zsd8y.1
Organization logo

Requirements data sets (user stories)

Explore at:
25 scholarly articles cite this dataset (View in Google Scholar)
txtAvailable download formats
Dataset updated
Jan 13, 2025
Dataset provided by
Mendeley Ltd.
Authors
Fabiano Dalpiaz; Fabiano Dalpiaz
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

A collection of 22 data set of 50+ requirements each, expressed as user stories.

The dataset has been created by gathering data from web sources and we are not aware of license agreements or intellectual property rights on the requirements / user stories. The curator took utmost diligence in minimizing the risks of copyright infringement by using non-recent data that is less likely to be critical, by sampling a subset of the original requirements collection, and by qualitatively analyzing the requirements. In case of copyright infringement, please contact the dataset curator (Fabiano Dalpiaz, f.dalpiaz@uu.nl) to discuss the possibility of removal of that dataset [see Zenodo's policies]

The data sets have been originally used to conduct experiments about ambiguity detection with the REVV-Light tool: https://github.com/RELabUU/revv-light

This collection has been originally published in Mendeley data: https://data.mendeley.com/datasets/7zbk8zsd8y/1

Overview of the datasets [data and links added in December 2024]

The following text provides a description of the datasets, including links to the systems and websites, when available. The datasets are organized by macro-category and then by identifier.

Public administration and transparency

g02-federalspending.txt (2018) originates from early data in the Federal Spending Transparency project, which pertain to the website that is used to share publicly the spending data for the U.S. government. The website was created because of the Digital Accountability and Transparency Act of 2014 (DATA Act). The specific dataset pertains a system called DAIMS or Data Broker, which stands for DATA Act Information Model Schema. The sample that was gathered refers to a sub-project related to allowing the government to act as a data broker, thereby providing data to third parties. The data for the Data Broker project is currently not available online, although the backend seems to be hosted in GitHub under a CC0 1.0 Universal license. Current and recent snapshots of federal spending related websites, including many more projects than the one described in the shared collection, can be found here.

g03-loudoun.txt (2018) is a set of extracted requirements from a document, by the Loudoun County Virginia, that describes the to-be user stories and use cases about a system for land management readiness assessment called Loudoun County LandMARC. The source document can be found here and it is part of the Electronic Land Management System and EPlan Review Project - RFP RFQ issued in March 2018. More information about the overall LandMARC system and services can be found here.

g04-recycling.txt(2017) concerns a web application where recycling and waste disposal facilities can be searched and located. The application operates through the visualization of a map that the user can interact with. The dataset has obtained from a GitHub website and it is at the basis of a students' project on web site design; the code is available (no license).

g05-openspending.txt (2018) is about the OpenSpending project (www), a project of the Open Knowledge foundation which aims at transparency about how local governments spend money. At the time of the collection, the data was retrieved from a Trello board that is currently unavailable. The sample focuses on publishing, importing and editing datasets, and how the data should be presented. Currently, OpenSpending is managed via a GitHub repository which contains multiple sub-projects with unknown license.

g11-nsf.txt (2018) refers to a collection of user stories referring to the NSF Site Redesign & Content Discovery project, which originates from a publicly accessible GitHub repository (GPL 2.0 license). In particular, the user stories refer to an early version of the NSF's website. The user stories can be found as closed Issues.

(Research) data and meta-data management

g08-frictionless.txt (2016) regards the Frictionless Data project, which offers an open source dataset for building data infrastructures, to be used by researchers, data scientists, and data engineers. Links to the many projects within the Frictionless Data project are on GitHub (with a mix of Unlicense and MIT license) and web. The specific set of user stories has been collected in 2016 by GitHub user @danfowler and are stored in a Trello board.

g14-datahub.txt (2013) concerns the open source project DataHub, which is currently developed via a GitHub repository (the code has Apache License 2.0). DataHub is a data discovery platform which has been developed over multiple years. The specific data set is an initial set of user stories, which we can date back to 2013 thanks to a comment therein.

g16-mis.txt (2015) is a collection of user stories that pertains a repository for researchers and archivists. The source of the dataset is a public Trello repository. Although the user stories do not have explicit links to projects, it can be inferred that the stories originate from some project related to the library of Duke University.

g17-cask.txt (2016) refers to the Cask Data Application Platform (CDAP). CDAP is an open source application platform (GitHub, under Apache License 2.0) that can be used to develop applications within the Apache Hadoop ecosystem, an open-source framework which can be used for distributed processing of large datasets. The user stories are extracted from a document that includes requirements regarding dataset management for Cask 4.0, which includes the scenarios, user stories and a design for the implementation of these user stories. The raw data is available in the following environment.

g18-neurohub.txt (2012) is concerned with the NeuroHub platform, a neuroscience data management, analysis and collaboration platform for researchers in neuroscience to collect, store, and share data with colleagues or with the research community. The user stories were collected at a time NeuroHub was still a research project sponsored by the UK Joint Information Systems Committee (JISC). For information about the research project from which the requirements were collected, see the following record.

g22-rdadmp.txt (2018) is a collection of user stories from the Research Data Alliance's working group on DMP Common Standards. Their GitHub repository contains a collection of user stories that were created by asking the community to suggest functionality that should part of a website that manages data management plans. Each user story is stored as an issue on the GitHub's page.

g23-archivesspace.txt (2012-2013) refers to ArchivesSpace: an open source, web application for managing archives information. The application is designed to support core functions in archives administration such as accessioning; description and arrangement of processed materials including analog, hybrid, and
born digital content; management of authorities and rights; and reference service. The application supports collection management through collection management records, tracking of events, and a growing number of administrative reports. ArchivesSpace is open source and its

Search
Clear search
Close search
Google apps
Main menu