9 datasets found
  1. Z

    The Software Heritage Graph Dataset

    • data.niaid.nih.gov
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antoine Pietri (2020). The Software Heritage Graph Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_2583977
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Diomidis Spinellis
    Antoine Pietri
    Stefano Zacchiroli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Software Heritage is the largest existing public archive of software source code and accompanying development history: it currently spans more than five billion unique source code files and one billion unique commits, coming from more than 80 million software projects.

    This is the Software Heritage graph dataset: a fully-deduplicated Merkle DAG representation of the Software Heritage archive. The dataset links together file content identifiers, source code directories, Version Control System (VCS) commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls. The dataset’s contents come from major development forges (including GitHub and GitLab), FOSS distributions (e.g., Debian), and language-specific package managers (e.g., PyPI). Crawling information is also included, providing timestamps about when and where all archived source code artifacts have been observed in the wild.

    The Software Heritage graph dataset is available in multiple formats, including downloadable CSV dumps and Apache Parquet files for local use, as well as a public instance on Amazon Athena interactive query service for ready-to-use powerful analytical processing.

    By accessing the dataset, you agree with the Software Heritage Ethical Charter for using the archive data, and the terms of use for bulk access.

    If you use this dataset for research purposes, please cite the following paper:

    Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli. The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Co-located with ICSE 2019. preprint, bibtex

    You can also refer to the above paper for more information the dataset and sample queries.

  2. D

    Online File Sharing Platforms Market Report | Global Forecast From 2025 To...

    • dataintelo.com
    csv, pdf, pptx
    Updated Oct 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2024). Online File Sharing Platforms Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/online-file-sharing-platforms-market
    Explore at:
    pptx, csv, pdfAvailable download formats
    Dataset updated
    Oct 16, 2024
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Online File Sharing Platforms Market Outlook




    The global Online File Sharing Platforms market size was valued at approximately USD 8.7 billion in 2023 and is projected to reach USD 17.2 billion by 2032, growing at a compound annual growth rate (CAGR) of 8.2% during the forecast period. This robust growth is driven by the increasing need for seamless collaboration in remote work environments, the rise in digital transformation initiatives, and the growing adoption of cloud-based solutions. The market’s expansion is further fueled by the scalability and flexibility offered by these platforms, which are crucial for businesses navigating the complexities of modern digital operations.




    A significant growth factor for the online file sharing platforms market is the shift towards remote and hybrid working models. The COVID-19 pandemic catalyzed a global transition towards remote work, and even post-pandemic, many organizations have adopted hybrid working models as a permanent fixture. This necessitates efficient and secure file sharing solutions. Platforms that enable real-time collaboration, secure file transfers, and integration with other productivity tools are seeing increased demand. Moreover, the ability to share files across various devices seamlessly is critical in maintaining productivity and ensuring business continuity, thus driving market growth.




    Another key driver is the advancing digital transformation efforts across industries. Organizations are increasingly investing in digital tools and infrastructure to streamline operations, enhance data management, and improve overall efficiency. Online file sharing platforms play a vital role in these initiatives by facilitating the easy exchange and storage of large volumes of data. These platforms also offer features like version control, audit trails, and automated workflows, which are essential for maintaining compliance and ensuring efficient data management. Consequently, the demand for advanced file sharing solutions is on the rise, contributing significantly to market growth.




    The proliferation of cloud-based technologies is also a major growth factor for the online file sharing platforms market. Cloud-based solutions offer numerous advantages, including scalability, cost-efficiency, and accessibility from any location with an internet connection. As businesses increasingly migrate their operations to the cloud, the demand for cloud-based file sharing platforms is surging. These platforms provide the necessary infrastructure for secure data storage, sharing, and collaboration, making them indispensable for modern enterprises. Additionally, the continuous advancements in cloud security measures are enhancing the trust and adoption of cloud-based file sharing solutions.




    Regionally, North America holds a significant share of the online file sharing platforms market, driven by the high adoption rate of advanced technologies and the presence of major market players. The region's well-established IT infrastructure and the increasing number of remote workers are also contributing factors. Europe is another prominent market, with growing digital transformation initiatives and stringent data protection regulations driving the demand for secure file sharing solutions. The Asia Pacific region is expected to witness the highest growth rate during the forecast period, supported by rapid technological advancements, increasing internet penetration, and the growing adoption of cloud services in emerging economies.



    Type Analysis




    The type segment of the online file sharing platforms market is bifurcated into Cloud-Based and On-Premises solutions. Cloud-Based file sharing platforms are experiencing significant growth due to their inherent flexibility and scalability. These platforms enable users to access and share files from any location with an internet connection, which is particularly advantageous in the current global shift towards remote and hybrid working models. Additionally, cloud-based solutions eliminate the need for substantial upfront investments in physical infrastructure, making them a cost-effective option for small and medium-sized enterprises (SMEs) and large organizations alike.




    On the other hand, On-Premises file sharing platforms are favored by organizations with stringent data security and compliance requirements. These platforms allow companies to maintain full control over their data, which is crucial for industries dealing with sensitive informat

  3. c

    Frictionless Data Standards Compliance: Stores metadata as datapackage.json...

    • catalog.civicdataecosystem.org
    Updated Jun 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Frictionless Data Standards Compliance: Stores metadata as datapackage.json files, ensuring interoperability with tools and libraries that support the Frictionless Data specifications. [Dataset]. https://catalog.civicdataecosystem.org/dataset/ckanext-gitdatahub
    Explore at:
    Dataset updated
    Jun 4, 2025
    Description

    Git LFS Support: Integrates with Git LFS to manage large resource files effectively, preventing repository bloat. Extensible Backend Support: Aims to support additional Git services like GitLab in future releases. Technical Integration: The extension operates by adding plugins to CKAN (gitdatahubpackage and gitdatahubresource). These plugins hook into CKAN's workflow to automatically write dataset and resource metadata to the configured Git repository. The extension requires configuration via CKAN's .ini file to enable the plugins and provide necessary settings, such as the GitHub API access token. Benefits & Impact: Utilizing the gitdatahub extension provides version control for CKAN metadata, enabling administrators to track changes to datasets and resources over time. The storage of metadata in the Frictionless Data format promotes interoperability and data portability, due to well-defined open standards. Use of Git provides an audit trail and allows others to collaborate and contribute. The extension is helpful when organizations need to keep copy of the metadata outside of CKAN and want to provide an audit trail for their data.

  4. f

    Supplement 1. Code for conducting the analyses and generating the figures in...

    • datasetcatalog.nlm.nih.gov
    • wiley.figshare.com
    Updated Aug 10, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Supp, S. R.; Graham, Catherine H.; Powers, Donald R.; Goetz, Scott; Wethington, Susan M.; La Sorte, Frank A.; Cormier, Tina A.; Lim, Marisa C. W. (2016). Supplement 1. Code for conducting the analyses and generating the figures in this paper, including the raw data. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001584525
    Explore at:
    Dataset updated
    Aug 10, 2016
    Authors
    Supp, S. R.; Graham, Catherine H.; Powers, Donald R.; Goetz, Scott; Wethington, Susan M.; La Sorte, Frank A.; Cormier, Tina A.; Lim, Marisa C. W.
    Description

    File List hb-migration.r (MD5: 1904c1692a02d984890e4575d0eeb4e6) R script that imports the eBird, map, and equal-area icosahedron data, summarizes the population-level migration patterns, runs the statistical analyses, and outputs figures. migration-fxns.r (MD5: a2ae2a47c066a253f18cad5b13cddcf6) R script that holds the relevant functions for executing the hb-migration.R script. BBL-Appendix.r (MD5: 370c701d6afb07851907922dcab51de4) R script that imports the Breeding Bird Laboratory data and outputs the figures for the Appendix. output-data.zip (MD5: 36e3a92a7d35e84b299d82c8bd746950) Folder containing the partially-processed text files (15 .txt files, 3 per species for centroids, migration dates, and migration speed) for the main analyses and figures in the paper. These text files can be used in part II of hb-migration.r and contain output data on the daily population-level centroids, migration dates, and migration speed. Part I of hb-migration.r relies on raw eBird data, which was queried from the eBird server directly. The raw eBird data can be requested through their online portal after making a user account (http://help.ebird.org/customer/portal/articles/1010524-can-i-download-raw-data-from-ebird-). The equal-area icosahedron maps are available at (http://discreteglobalgrids.org/). The BBL data, used in BBL-Appendix.R, can be requested from the USGS Bird Banding Laboratory (http://www.pwrc.usgs.gov/BBL/homepage/datarequest.cfm). Description The code and data in this supplement allow for the analyses and figures in the paper to be fully replicated using a data set of manipulated communities collected from the literature. Requirements: R 3.x, and the following packages: chron, fields, knitr, gamm4, geosphere, ggplot2, ggmap, maps, maptools, mapdata, mgcv, plyr, raster, reshape2, rgdal, Rmisc, SDMTools, sp, spaa, and files containing functions specific to this code (listed above). The analyses can then be replicated by changing the working directory at the top of the file hb-migration.R to the location on your computer where you have stored the .R and .csv files and running the code. Note that to fully replicate the analyses, the data will need to be requested from the sources listed above. Starting at Part II in hb-migration.R, it should take approximately 30 minutes to run all the code from start to finish. Figures should output as pdfs in your working directory. If you download the raw data and run the analyses starting at Part I, you will need a workstation with large memory to run the analyses in a reasonable amount of time since the raw eBird datafiles are very large. Version Control Repository: The full version control repository for this project (including post- publication improvements) is publicly available https://github.com/sarahsupp/hb-migration. If you would like to use the code in this Supplement for your own analyses it is strongly suggested that you use the equivalent code in the repositories as this is the code that is being actively maintained and developed. Data use: Partially-processed data is provided in this supplement for the purposes of replication. If you wish to use the raw data for additional research, they should be obtained from the original data providers listed above.

  5. f

    Supplement 1. Code for conducting the analyses and generating the figures in...

    • wiley.figshare.com
    html
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarah R. Supp; David N. Koons; S. K. Morgan Ernest (2023). Supplement 1. Code for conducting the analyses and generating the figures in this paper, including partially processed data. [Dataset]. http://doi.org/10.6084/m9.figshare.3564183.v1
    Explore at:
    htmlAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Wiley
    Authors
    Sarah R. Supp; David N. Koons; S. K. Morgan Ernest
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    File List rodent_wrapper.r (MD5: 2c73de19e83585b1f4c37ebb9ee9ab1f)R script that imports the eBird, map, and equal-area icosahedron data, summarizes the population-level migration patterns, runs the statistical analyses, and outputs figures.

        movement_fxns.r (MD5: 4417176e0bfed18b3c2188eb26a5908e)R script that holds the relevant functions for executing the hb-migration.R script.
    
        MARK_analyses.r (MD5: 0a59e029a076e1bec8b4fb529af4c361)R script that imports the Breeding Bird Laboratory data and outputs the figures for the Appendix.
    
      Description
        The code in this supplement allows for the analyses and figures in the paper to be fully replicated using a subset of the published Portal data set which includes individual-level rodent data from 1989–2009. Species evaluated include granivores, folivores, and insectivores: Peromyscus eremicus (PE), Peromyscus maniculatus (PM), Peromyscus leucopus (PL), Onychomys torridus (OT), Onychomys leucogaster (OL), Dipodomys merriami (DM), Dipodomys ordii (DO), Dipodomys spectabilis (DS), Chaetodipus baileyi (PB), Chaetodipus penicillatus (PP), Perognathus flavus (PF), Chaetodipus intermedius (PI), Chaetodipus hispidus (PH), Sigmodon hispidus (SH), Sigmodon fulviventer (SF), Sigmodon ochrognathus (SO), Neotoma albigula (NAO), Baiomys taylori (BA), Reithrodontomys megalotis (RM), Reithrodontomys fulvescens (RF), and Reithrodontomys montanus (RM).
        Requirements: R 2.x, Program MARK (http://www.phidot.org/software/mark), the files containing data and functions specific to this code and the following packages: ape, calibrate, fields, geiger, ggbiplot, ggmap, ggplot2, gridExtra, picante, PhyloOrchard,plyr, reshape2, and RMark.
        The analyses can then be replicated by changing the working directory at the top of the file rodent_wrapper.R to the location on your computer where you have stored the .R and .csv files and running the code.
        Code for Part I of rodent_wrapper.R should take approximately 30 minutes to run, but depending on the capabilities of the computer used to run the code, it may take many hours to run the code in MARK_analyses.R. Figures should output as pdf, png, or eps files in your working directory. Part II of rodent_wrapper.R continues the anaylsis using the MARK results. If you download the raw data and run the start to finish, you will need a workstation with large memory to run the program in a reasonable amount of time since the files are large and the analyses require a lot of memory.
       Version Control Repository: The full version control repository for this project (including post-publication improvements) is publicly available at https://github.com/weecology/portal-rodent-dispersal. If you would like to use the code in this Supplement for your own analyses it is strongly suggested that you use the equivalent code in the repositories as this is the code that is being actively maintained and developed. 
        Data use: Partially-processed data is provided in the GitHub repository for the purposes of replication. The raw data should be obtained from the original data providers (Ernest et al. 2009) and can be downloaded from Ecological Archives (http://www.esajournals.org/doi/abs/10.1890/08-1222.1).
    
  6. o

    Data from: Training Data for the NeonTreeEvaluation Benchmark

    • explore.openaire.eu
    • zenodo.org
    Updated Jan 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ben Weinstein; Sergio Marconi; Ethan White (2020). Training Data for the NeonTreeEvaluation Benchmark [Dataset]. http://doi.org/10.5281/zenodo.5912107
    Explore at:
    Dataset updated
    Jan 1, 2020
    Authors
    Ben Weinstein; Sergio Marconi; Ethan White
    Description

    This dataset is the large training data files for the NeonTreeEvaluation Benchmark for individual tree detection from airborne imagery. For each geographic site, given by the NEON four letter code (e.g HARV -> Harvard Forest), there are up to 4 files: a RGB image, a LiDAR tile, and a 426 band hyperpspectral file, and a 1m canopy height file. For more information on the benchmark, and the corresponding R package, see https://github.com/weecology/NeonTreeEvaluation_package Annotations for the tiles, made by looking at the RGB are under version control here: https://github.com/weecology/NeonTreeEvaluation/tree/master/annotations. Download the training.zip to get all files in the same folder organization as the evaluation data.

  7. Software Heritage Graph Dataset

    • registry.opendata.aws
    Updated Mar 12, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Software Heritage (2019). Software Heritage Graph Dataset [Dataset]. https://registry.opendata.aws/software-heritage/
    Explore at:
    Dataset updated
    Mar 12, 2019
    Dataset provided by
    Software Heritagehttps://softwareheritage.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Software Heritage is the largest existing public archive of software source code and accompanying development history. The Software Heritage Graph Dataset is a fully deduplicated Merkle DAG representation of the Software Heritage archive.The dataset links together file content identifiers, source code directories, Version Control System (VCS) commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls. The dataset’s contents come from major development forges (including GitHub and GitLab), FOSS distributions (e.g., Debian), and language-specific package managers (e.g., PyPI). Crawling information is also included, providing timestamps about when and where all archived source code artifacts have been observed in the wild. Author and committer information is anonymized.

  8. f

    BiNA: A Visual Analytics Tool for Biological Network Data

    • plos.figshare.com
    docx
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andreas Gerasch; Daniel Faber; Jan Küntzer; Peter Niermann; Oliver Kohlbacher; Hans-Peter Lenhof; Michael Kaufmann (2023). BiNA: A Visual Analytics Tool for Biological Network Data [Dataset]. http://doi.org/10.1371/journal.pone.0087397
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Andreas Gerasch; Daniel Faber; Jan Küntzer; Peter Niermann; Oliver Kohlbacher; Hans-Peter Lenhof; Michael Kaufmann
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Interactive visual analysis of biological high-throughput data in the context of the underlying networks is an essential task in modern biomedicine with applications ranging from metabolic engineering to personalized medicine. The complexity and heterogeneity of data sets require flexible software architectures for data analysis. Concise and easily readable graphical representation of data and interactive navigation of large data sets are essential in this context. We present BiNA - the Biological Network Analyzer - a flexible open-source software for analyzing and visualizing biological networks. Highly configurable visualization styles for regulatory and metabolic network data offer sophisticated drawings and intuitive navigation and exploration techniques using hierarchical graph concepts. The generic projection and analysis framework provides powerful functionalities for visual analyses of high-throughput omics data in the context of networks, in particular for the differential analysis and the analysis of time series data. A direct interface to an underlying data warehouse provides fast access to a wide range of semantically integrated biological network databases. A plugin system allows simple customization and integration of new analysis algorithms or visual representations. BiNA is available under the 3-clause BSD license at http://bina.unipax.info/.

  9. Requirements data sets (user stories)

    • zenodo.org
    • data.mendeley.com
    txt
    Updated Jan 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabiano Dalpiaz; Fabiano Dalpiaz (2025). Requirements data sets (user stories) [Dataset]. http://doi.org/10.17632/7zbk8zsd8y.1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 13, 2025
    Dataset provided by
    Mendeley Ltd.
    Authors
    Fabiano Dalpiaz; Fabiano Dalpiaz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A collection of 22 data set of 50+ requirements each, expressed as user stories.

    The dataset has been created by gathering data from web sources and we are not aware of license agreements or intellectual property rights on the requirements / user stories. The curator took utmost diligence in minimizing the risks of copyright infringement by using non-recent data that is less likely to be critical, by sampling a subset of the original requirements collection, and by qualitatively analyzing the requirements. In case of copyright infringement, please contact the dataset curator (Fabiano Dalpiaz, f.dalpiaz@uu.nl) to discuss the possibility of removal of that dataset [see Zenodo's policies]

    The data sets have been originally used to conduct experiments about ambiguity detection with the REVV-Light tool: https://github.com/RELabUU/revv-light

    This collection has been originally published in Mendeley data: https://data.mendeley.com/datasets/7zbk8zsd8y/1

    Overview of the datasets [data and links added in December 2024]

    The following text provides a description of the datasets, including links to the systems and websites, when available. The datasets are organized by macro-category and then by identifier.

    Public administration and transparency

    g02-federalspending.txt (2018) originates from early data in the Federal Spending Transparency project, which pertain to the website that is used to share publicly the spending data for the U.S. government. The website was created because of the Digital Accountability and Transparency Act of 2014 (DATA Act). The specific dataset pertains a system called DAIMS or Data Broker, which stands for DATA Act Information Model Schema. The sample that was gathered refers to a sub-project related to allowing the government to act as a data broker, thereby providing data to third parties. The data for the Data Broker project is currently not available online, although the backend seems to be hosted in GitHub under a CC0 1.0 Universal license. Current and recent snapshots of federal spending related websites, including many more projects than the one described in the shared collection, can be found here.

    g03-loudoun.txt (2018) is a set of extracted requirements from a document, by the Loudoun County Virginia, that describes the to-be user stories and use cases about a system for land management readiness assessment called Loudoun County LandMARC. The source document can be found here and it is part of the Electronic Land Management System and EPlan Review Project - RFP RFQ issued in March 2018. More information about the overall LandMARC system and services can be found here.

    g04-recycling.txt(2017) concerns a web application where recycling and waste disposal facilities can be searched and located. The application operates through the visualization of a map that the user can interact with. The dataset has obtained from a GitHub website and it is at the basis of a students' project on web site design; the code is available (no license).

    g05-openspending.txt (2018) is about the OpenSpending project (www), a project of the Open Knowledge foundation which aims at transparency about how local governments spend money. At the time of the collection, the data was retrieved from a Trello board that is currently unavailable. The sample focuses on publishing, importing and editing datasets, and how the data should be presented. Currently, OpenSpending is managed via a GitHub repository which contains multiple sub-projects with unknown license.

    g11-nsf.txt (2018) refers to a collection of user stories referring to the NSF Site Redesign & Content Discovery project, which originates from a publicly accessible GitHub repository (GPL 2.0 license). In particular, the user stories refer to an early version of the NSF's website. The user stories can be found as closed Issues.

    (Research) data and meta-data management

    g08-frictionless.txt (2016) regards the Frictionless Data project, which offers an open source dataset for building data infrastructures, to be used by researchers, data scientists, and data engineers. Links to the many projects within the Frictionless Data project are on GitHub (with a mix of Unlicense and MIT license) and web. The specific set of user stories has been collected in 2016 by GitHub user @danfowler and are stored in a Trello board.

    g14-datahub.txt (2013) concerns the open source project DataHub, which is currently developed via a GitHub repository (the code has Apache License 2.0). DataHub is a data discovery platform which has been developed over multiple years. The specific data set is an initial set of user stories, which we can date back to 2013 thanks to a comment therein.

    g16-mis.txt (2015) is a collection of user stories that pertains a repository for researchers and archivists. The source of the dataset is a public Trello repository. Although the user stories do not have explicit links to projects, it can be inferred that the stories originate from some project related to the library of Duke University.

    g17-cask.txt (2016) refers to the Cask Data Application Platform (CDAP). CDAP is an open source application platform (GitHub, under Apache License 2.0) that can be used to develop applications within the Apache Hadoop ecosystem, an open-source framework which can be used for distributed processing of large datasets. The user stories are extracted from a document that includes requirements regarding dataset management for Cask 4.0, which includes the scenarios, user stories and a design for the implementation of these user stories. The raw data is available in the following environment.

    g18-neurohub.txt (2012) is concerned with the NeuroHub platform, a neuroscience data management, analysis and collaboration platform for researchers in neuroscience to collect, store, and share data with colleagues or with the research community. The user stories were collected at a time NeuroHub was still a research project sponsored by the UK Joint Information Systems Committee (JISC). For information about the research project from which the requirements were collected, see the following record.

    g22-rdadmp.txt (2018) is a collection of user stories from the Research Data Alliance's working group on DMP Common Standards. Their GitHub repository contains a collection of user stories that were created by asking the community to suggest functionality that should part of a website that manages data management plans. Each user story is stored as an issue on the GitHub's page.

    g23-archivesspace.txt (2012-2013) refers to ArchivesSpace: an open source, web application for managing archives information. The application is designed to support core functions in archives administration such as accessioning; description and arrangement of processed materials including analog, hybrid, and
    born digital content; management of authorities and rights; and reference service. The application supports collection management through collection management records, tracking of events, and a growing number of administrative reports. ArchivesSpace is open source and its

  10. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Antoine Pietri (2020). The Software Heritage Graph Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_2583977

The Software Heritage Graph Dataset

Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Diomidis Spinellis
Antoine Pietri
Stefano Zacchiroli
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Software Heritage is the largest existing public archive of software source code and accompanying development history: it currently spans more than five billion unique source code files and one billion unique commits, coming from more than 80 million software projects.

This is the Software Heritage graph dataset: a fully-deduplicated Merkle DAG representation of the Software Heritage archive. The dataset links together file content identifiers, source code directories, Version Control System (VCS) commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls. The dataset’s contents come from major development forges (including GitHub and GitLab), FOSS distributions (e.g., Debian), and language-specific package managers (e.g., PyPI). Crawling information is also included, providing timestamps about when and where all archived source code artifacts have been observed in the wild.

The Software Heritage graph dataset is available in multiple formats, including downloadable CSV dumps and Apache Parquet files for local use, as well as a public instance on Amazon Athena interactive query service for ready-to-use powerful analytical processing.

By accessing the dataset, you agree with the Software Heritage Ethical Charter for using the archive data, and the terms of use for bulk access.

If you use this dataset for research purposes, please cite the following paper:

Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli. The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Co-located with ICSE 2019. preprint, bibtex

You can also refer to the above paper for more information the dataset and sample queries.

Search
Clear search
Close search
Google apps
Main menu