9 datasets found

Z
The Software Heritage Graph Dataset
data.niaid.nih.gov
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antoine Pietri (2020). The Software Heritage Graph Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_2583977
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Diomidis Spinellis
Antoine Pietri
Stefano Zacchiroli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Software Heritage is the largest existing public archive of software source code and accompanying development history: it currently spans more than five billion unique source code files and one billion unique commits, coming from more than 80 million software projects.

This is the Software Heritage graph dataset: a fully-deduplicated Merkle DAG representation of the Software Heritage archive. The dataset links together file content identifiers, source code directories, Version Control System (VCS) commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls. The dataset’s contents come from major development forges (including GitHub and GitLab), FOSS distributions (e.g., Debian), and language-specific package managers (e.g., PyPI). Crawling information is also included, providing timestamps about when and where all archived source code artifacts have been observed in the wild.

The Software Heritage graph dataset is available in multiple formats, including downloadable CSV dumps and Apache Parquet files for local use, as well as a public instance on Amazon Athena interactive query service for ready-to-use powerful analytical processing.

By accessing the dataset, you agree with the Software Heritage Ethical Charter for using the archive data, and the terms of use for bulk access.

If you use this dataset for research purposes, please cite the following paper:

Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli. The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Co-located with ICSE 2019. preprint, bibtex

You can also refer to the above paper for more information the dataset and sample queries.
D
Online File Sharing Platforms Market Report | Global Forecast From 2025 To...
dataintelo.com
csv, pdf, pptx
Updated Oct 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2024). Online File Sharing Platforms Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/online-file-sharing-platforms-market
Explore at:
pptx, csv, pdfAvailable download formats
Dataset updated
Oct 16, 2024
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Online File Sharing Platforms Market Outlook

The global Online File Sharing Platforms market size was valued at approximately USD 8.7 billion in 2023 and is projected to reach USD 17.2 billion by 2032, growing at a compound annual growth rate (CAGR) of 8.2% during the forecast period. This robust growth is driven by the increasing need for seamless collaboration in remote work environments, the rise in digital transformation initiatives, and the growing adoption of cloud-based solutions. The market’s expansion is further fueled by the scalability and flexibility offered by these platforms, which are crucial for businesses navigating the complexities of modern digital operations.

A significant growth factor for the online file sharing platforms market is the shift towards remote and hybrid working models. The COVID-19 pandemic catalyzed a global transition towards remote work, and even post-pandemic, many organizations have adopted hybrid working models as a permanent fixture. This necessitates efficient and secure file sharing solutions. Platforms that enable real-time collaboration, secure file transfers, and integration with other productivity tools are seeing increased demand. Moreover, the ability to share files across various devices seamlessly is critical in maintaining productivity and ensuring business continuity, thus driving market growth.

Another key driver is the advancing digital transformation efforts across industries. Organizations are increasingly investing in digital tools and infrastructure to streamline operations, enhance data management, and improve overall efficiency. Online file sharing platforms play a vital role in these initiatives by facilitating the easy exchange and storage of large volumes of data. These platforms also offer features like version control, audit trails, and automated workflows, which are essential for maintaining compliance and ensuring efficient data management. Consequently, the demand for advanced file sharing solutions is on the rise, contributing significantly to market growth.

The proliferation of cloud-based technologies is also a major growth factor for the online file sharing platforms market. Cloud-based solutions offer numerous advantages, including scalability, cost-efficiency, and accessibility from any location with an internet connection. As businesses increasingly migrate their operations to the cloud, the demand for cloud-based file sharing platforms is surging. These platforms provide the necessary infrastructure for secure data storage, sharing, and collaboration, making them indispensable for modern enterprises. Additionally, the continuous advancements in cloud security measures are enhancing the trust and adoption of cloud-based file sharing solutions.

Regionally, North America holds a significant share of the online file sharing platforms market, driven by the high adoption rate of advanced technologies and the presence of major market players. The region's well-established IT infrastructure and the increasing number of remote workers are also contributing factors. Europe is another prominent market, with growing digital transformation initiatives and stringent data protection regulations driving the demand for secure file sharing solutions. The Asia Pacific region is expected to witness the highest growth rate during the forecast period, supported by rapid technological advancements, increasing internet penetration, and the growing adoption of cloud services in emerging economies.

Type Analysis

The type segment of the online file sharing platforms market is bifurcated into Cloud-Based and On-Premises solutions. Cloud-Based file sharing platforms are experiencing significant growth due to their inherent flexibility and scalability. These platforms enable users to access and share files from any location with an internet connection, which is particularly advantageous in the current global shift towards remote and hybrid working models. Additionally, cloud-based solutions eliminate the need for substantial upfront investments in physical infrastructure, making them a cost-effective option for small and medium-sized enterprises (SMEs) and large organizations alike.

On the other hand, On-Premises file sharing platforms are favored by organizations with stringent data security and compliance requirements. These platforms allow companies to maintain full control over their data, which is crucial for industries dealing with sensitive informat
c
Frictionless Data Standards Compliance: Stores metadata as datapackage.json...
catalog.civicdataecosystem.org
Updated Jun 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Frictionless Data Standards Compliance: Stores metadata as datapackage.json files, ensuring interoperability with tools and libraries that support the Frictionless Data specifications. [Dataset]. https://catalog.civicdataecosystem.org/dataset/ckanext-gitdatahub
Explore at:
Dataset updated
Jun 4, 2025
Description
Git LFS Support: Integrates with Git LFS to manage large resource files effectively, preventing repository bloat. Extensible Backend Support: Aims to support additional Git services like GitLab in future releases. Technical Integration: The extension operates by adding plugins to CKAN (gitdatahubpackage and gitdatahubresource). These plugins hook into CKAN's workflow to automatically write dataset and resource metadata to the configured Git repository. The extension requires configuration via CKAN's .ini file to enable the plugins and provide necessary settings, such as the GitHub API access token. Benefits & Impact: Utilizing the gitdatahub extension provides version control for CKAN metadata, enabling administrators to track changes to datasets and resources over time. The storage of metadata in the Frictionless Data format promotes interoperability and data portability, due to well-defined open standards. Use of Git provides an audit trail and allows others to collaborate and contribute. The extension is helpful when organizations need to keep copy of the metadata outside of CKAN and want to provide an audit trail for their data.
f
Supplement 1. Code for conducting the analyses and generating the figures in...
datasetcatalog.nlm.nih.gov
wiley.figshare.com
Updated Aug 10, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Supp, S. R.; Graham, Catherine H.; Powers, Donald R.; Goetz, Scott; Wethington, Susan M.; La Sorte, Frank A.; Cormier, Tina A.; Lim, Marisa C. W. (2016). Supplement 1. Code for conducting the analyses and generating the figures in this paper, including the raw data. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001584525
Explore at:
Dataset updated
Aug 10, 2016
Authors
Supp, S. R.; Graham, Catherine H.; Powers, Donald R.; Goetz, Scott; Wethington, Susan M.; La Sorte, Frank A.; Cormier, Tina A.; Lim, Marisa C. W.
Description
File List hb-migration.r (MD5: 1904c1692a02d984890e4575d0eeb4e6) R script that imports the eBird, map, and equal-area icosahedron data, summarizes the population-level migration patterns, runs the statistical analyses, and outputs figures. migration-fxns.r (MD5: a2ae2a47c066a253f18cad5b13cddcf6) R script that holds the relevant functions for executing the hb-migration.R script. BBL-Appendix.r (MD5: 370c701d6afb07851907922dcab51de4) R script that imports the Breeding Bird Laboratory data and outputs the figures for the Appendix. output-data.zip (MD5: 36e3a92a7d35e84b299d82c8bd746950) Folder containing the partially-processed text files (15 .txt files, 3 per species for centroids, migration dates, and migration speed) for the main analyses and figures in the paper. These text files can be used in part II of hb-migration.r and contain output data on the daily population-level centroids, migration dates, and migration speed. Part I of hb-migration.r relies on raw eBird data, which was queried from the eBird server directly. The raw eBird data can be requested through their online portal after making a user account (http://help.ebird.org/customer/portal/articles/1010524-can-i-download-raw-data-from-ebird-). The equal-area icosahedron maps are available at (http://discreteglobalgrids.org/). The BBL data, used in BBL-Appendix.R, can be requested from the USGS Bird Banding Laboratory (http://www.pwrc.usgs.gov/BBL/homepage/datarequest.cfm). Description The code and data in this supplement allow for the analyses and figures in the paper to be fully replicated using a data set of manipulated communities collected from the literature. Requirements: R 3.x, and the following packages: chron, fields, knitr, gamm4, geosphere, ggplot2, ggmap, maps, maptools, mapdata, mgcv, plyr, raster, reshape2, rgdal, Rmisc, SDMTools, sp, spaa, and files containing functions specific to this code (listed above). The analyses can then be replicated by changing the working directory at the top of the file hb-migration.R to the location on your computer where you have stored the .R and .csv files and running the code. Note that to fully replicate the analyses, the data will need to be requested from the sources listed above. Starting at Part II in hb-migration.R, it should take approximately 30 minutes to run all the code from start to finish. Figures should output as pdfs in your working directory. If you download the raw data and run the analyses starting at Part I, you will need a workstation with large memory to run the analyses in a reasonable amount of time since the raw eBird datafiles are very large. Version Control Repository: The full version control repository for this project (including post- publication improvements) is publicly available https://github.com/sarahsupp/hb-migration. If you would like to use the code in this Supplement for your own analyses it is strongly suggested that you use the equivalent code in the repositories as this is the code that is being actively maintained and developed. Data use: Partially-processed data is provided in this supplement for the purposes of replication. If you wish to use the raw data for additional research, they should be obtained from the original data providers listed above.

Supplement 1. Code for conducting the analyses and generating the figures in...

wiley.figshare.com

html

Updated May 30, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

Sarah R. Supp; David N. Koons; S. K. Morgan Ernest (2023). Supplement 1. Code for conducting the analyses and generating the figures in this paper, including partially processed data. [Dataset]. http://doi.org/10.6084/m9.figshare.3564183.v1

Explore at:

htmlAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.3564183.v1

Dataset updated

May 30, 2023

Dataset provided by

Wiley

Authors

Sarah R. Supp; David N. Koons; S. K. Morgan Ernest

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

File List rodent_wrapper.r (MD5: 2c73de19e83585b1f4c37ebb9ee9ab1f)R script that imports the eBird, map, and equal-area icosahedron data, summarizes the population-level migration patterns, runs the statistical analyses, and outputs figures.

    movement_fxns.r (MD5: 4417176e0bfed18b3c2188eb26a5908e)R script that holds the relevant functions for executing the hb-migration.R script.

    MARK_analyses.r (MD5: 0a59e029a076e1bec8b4fb529af4c361)R script that imports the Breeding Bird Laboratory data and outputs the figures for the Appendix.

  Description
    The code in this supplement allows for the analyses and figures in the paper to be fully replicated using a subset of the published Portal data set which includes individual-level rodent data from 1989–2009. Species evaluated include granivores, folivores, and insectivores: Peromyscus eremicus (PE), Peromyscus maniculatus (PM), Peromyscus leucopus (PL), Onychomys torridus (OT), Onychomys leucogaster (OL), Dipodomys merriami (DM), Dipodomys ordii (DO), Dipodomys spectabilis (DS), Chaetodipus baileyi (PB), Chaetodipus penicillatus (PP), Perognathus flavus (PF), Chaetodipus intermedius (PI), Chaetodipus hispidus (PH), Sigmodon hispidus (SH), Sigmodon fulviventer (SF), Sigmodon ochrognathus (SO), Neotoma albigula (NAO), Baiomys taylori (BA), Reithrodontomys megalotis (RM), Reithrodontomys fulvescens (RF), and Reithrodontomys montanus (RM).
    Requirements: R 2.x, Program MARK (http://www.phidot.org/software/mark), the files containing data and functions specific to this code and the following packages: ape, calibrate, fields, geiger, ggbiplot, ggmap, ggplot2, gridExtra, picante, PhyloOrchard,plyr, reshape2, and RMark.
    The analyses can then be replicated by changing the working directory at the top of the file rodent_wrapper.R to the location on your computer where you have stored the .R and .csv files and running the code.
    Code for Part I of rodent_wrapper.R should take approximately 30 minutes to run, but depending on the capabilities of the computer used to run the code, it may take many hours to run the code in MARK_analyses.R. Figures should output as pdf, png, or eps files in your working directory. Part II of rodent_wrapper.R continues the anaylsis using the MARK results. If you download the raw data and run the start to finish, you will need a workstation with large memory to run the program in a reasonable amount of time since the files are large and the analyses require a lot of memory.
   Version Control Repository: The full version control repository for this project (including post-publication improvements) is publicly available at https://github.com/weecology/portal-rodent-dispersal. If you would like to use the code in this Supplement for your own analyses it is strongly suggested that you use the equivalent code in the repositories as this is the code that is being actively maintained and developed. 
    Data use: Partially-processed data is provided in the GitHub repository for the purposes of replication. The raw data should be obtained from the original data providers (Ernest et al. 2009) and can be downloaded from Ecological Archives (http://www.esajournals.org/doi/abs/10.1890/08-1222.1).

o
Data from: Training Data for the NeonTreeEvaluation Benchmark
explore.openaire.eu
zenodo.org
Updated Jan 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ben Weinstein; Sergio Marconi; Ethan White (2020). Training Data for the NeonTreeEvaluation Benchmark [Dataset]. http://doi.org/10.5281/zenodo.5912107
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.5912107
Dataset updated
Jan 1, 2020
Authors
Ben Weinstein; Sergio Marconi; Ethan White
Description
This dataset is the large training data files for the NeonTreeEvaluation Benchmark for individual tree detection from airborne imagery. For each geographic site, given by the NEON four letter code (e.g HARV -> Harvard Forest), there are up to 4 files: a RGB image, a LiDAR tile, and a 426 band hyperpspectral file, and a 1m canopy height file. For more information on the benchmark, and the corresponding R package, see https://github.com/weecology/NeonTreeEvaluation_package Annotations for the tiles, made by looking at the RGB are under version control here: https://github.com/weecology/NeonTreeEvaluation/tree/master/annotations. Download the training.zip to get all files in the same folder organization as the evaluation data.
Software Heritage Graph Dataset
registry.opendata.aws
Updated Mar 12, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Software Heritage (2019). Software Heritage Graph Dataset [Dataset]. https://registry.opendata.aws/software-heritage/
Explore at:
Dataset updated
Mar 12, 2019
Dataset provided by
Software Heritagehttps://softwareheritage.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Software Heritage is the largest existing public archive of software source code and accompanying development history. The Software Heritage Graph Dataset is a fully deduplicated Merkle DAG representation of the Software Heritage archive.The dataset links together file content identifiers, source code directories, Version Control System (VCS) commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls. The dataset’s contents come from major development forges (including GitHub and GitLab), FOSS distributions (e.g., Debian), and language-specific package managers (e.g., PyPI). Crawling information is also included, providing timestamps about when and where all archived source code artifacts have been observed in the wild. Author and committer information is anonymized.
f
BiNA: A Visual Analytics Tool for Biological Network Data
plos.figshare.com
docx
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andreas Gerasch; Daniel Faber; Jan Küntzer; Peter Niermann; Oliver Kohlbacher; Hans-Peter Lenhof; Michael Kaufmann (2023). BiNA: A Visual Analytics Tool for Biological Network Data [Dataset]. http://doi.org/10.1371/journal.pone.0087397
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0087397
Dataset updated
Jun 2, 2023
Dataset provided by
PLOS ONE
Authors
Andreas Gerasch; Daniel Faber; Jan Küntzer; Peter Niermann; Oliver Kohlbacher; Hans-Peter Lenhof; Michael Kaufmann
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Interactive visual analysis of biological high-throughput data in the context of the underlying networks is an essential task in modern biomedicine with applications ranging from metabolic engineering to personalized medicine. The complexity and heterogeneity of data sets require flexible software architectures for data analysis. Concise and easily readable graphical representation of data and interactive navigation of large data sets are essential in this context. We present BiNA - the Biological Network Analyzer - a flexible open-source software for analyzing and visualizing biological networks. Highly configurable visualization styles for regulatory and metabolic network data offer sophisticated drawings and intuitive navigation and exploration techniques using hierarchical graph concepts. The generic projection and analysis framework provides powerful functionalities for visual analyses of high-throughput omics data in the context of networks, in particular for the differential analysis and the analysis of time series data. A direct interface to an underlying data warehouse provides fast access to a wide range of semantically integrated biological network databases. A plugin system allows simple customization and integration of new analysis algorithms or visual representations. BiNA is available under the 3-clause BSD license at http://bina.unipax.info/.
Requirements data sets (user stories)
zenodo.org
data.mendeley.com
txt
Updated Jan 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabiano Dalpiaz; Fabiano Dalpiaz (2025). Requirements data sets (user stories) [Dataset]. http://doi.org/10.17632/7zbk8zsd8y.1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.17632/7zbk8zsd8y.1
Dataset updated
Jan 13, 2025
Dataset provided by
Mendeley Ltd.
Authors
Fabiano Dalpiaz; Fabiano Dalpiaz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A collection of 22 data set of 50+ requirements each, expressed as user stories.

The dataset has been created by gathering data from web sources and we are not aware of license agreements or intellectual property rights on the requirements / user stories. The curator took utmost diligence in minimizing the risks of copyright infringement by using non-recent data that is less likely to be critical, by sampling a subset of the original requirements collection, and by qualitatively analyzing the requirements. In case of copyright infringement, please contact the dataset curator (Fabiano Dalpiaz, f.dalpiaz@uu.nl) to discuss the possibility of removal of that dataset [see Zenodo's policies]

The data sets have been originally used to conduct experiments about ambiguity detection with the REVV-Light tool: https://github.com/RELabUU/revv-light

This collection has been originally published in Mendeley data: https://data.mendeley.com/datasets/7zbk8zsd8y/1

Overview of the datasets [data and links added in December 2024]

The following text provides a description of the datasets, including links to the systems and websites, when available. The datasets are organized by macro-category and then by identifier.

Public administration and transparency

g02-federalspending.txt (2018) originates from early data in the Federal Spending Transparency project, which pertain to the website that is used to share publicly the spending data for the U.S. government. The website was created because of the Digital Accountability and Transparency Act of 2014 (DATA Act). The specific dataset pertains a system called DAIMS or Data Broker, which stands for DATA Act Information Model Schema. The sample that was gathered refers to a sub-project related to allowing the government to act as a data broker, thereby providing data to third parties. The data for the Data Broker project is currently not available online, although the backend seems to be hosted in GitHub under a CC0 1.0 Universal license. Current and recent snapshots of federal spending related websites, including many more projects than the one described in the shared collection, can be found here.

g03-loudoun.txt (2018) is a set of extracted requirements from a document, by the Loudoun County Virginia, that describes the to-be user stories and use cases about a system for land management readiness assessment called Loudoun County LandMARC. The source document can be found here and it is part of the Electronic Land Management System and EPlan Review Project - RFP RFQ issued in March 2018. More information about the overall LandMARC system and services can be found here.

g04-recycling.txt(2017) concerns a web application where recycling and waste disposal facilities can be searched and located. The application operates through the visualization of a map that the user can interact with. The dataset has obtained from a GitHub website and it is at the basis of a students' project on web site design; the code is available (no license).

g05-openspending.txt (2018) is about the OpenSpending project (www), a project of the Open Knowledge foundation which aims at transparency about how local governments spend money. At the time of the collection, the data was retrieved from a Trello board that is currently unavailable. The sample focuses on publishing, importing and editing datasets, and how the data should be presented. Currently, OpenSpending is managed via a GitHub repository which contains multiple sub-projects with unknown license.

g11-nsf.txt (2018) refers to a collection of user stories referring to the NSF Site Redesign & Content Discovery project, which originates from a publicly accessible GitHub repository (GPL 2.0 license). In particular, the user stories refer to an early version of the NSF's website. The user stories can be found as closed Issues.

(Research) data and meta-data management

g08-frictionless.txt (2016) regards the Frictionless Data project, which offers an open source dataset for building data infrastructures, to be used by researchers, data scientists, and data engineers. Links to the many projects within the Frictionless Data project are on GitHub (with a mix of Unlicense and MIT license) and web. The specific set of user stories has been collected in 2016 by GitHub user @danfowler and are stored in a Trello board.

g14-datahub.txt (2013) concerns the open source project DataHub, which is currently developed via a GitHub repository (the code has Apache License 2.0). DataHub is a data discovery platform which has been developed over multiple years. The specific data set is an initial set of user stories, which we can date back to 2013 thanks to a comment therein.

g16-mis.txt (2015) is a collection of user stories that pertains a repository for researchers and archivists. The source of the dataset is a public Trello repository. Although the user stories do not have explicit links to projects, it can be inferred that the stories originate from some project related to the library of Duke University.

g17-cask.txt (2016) refers to the Cask Data Application Platform (CDAP). CDAP is an open source application platform (GitHub, under Apache License 2.0) that can be used to develop applications within the Apache Hadoop ecosystem, an open-source framework which can be used for distributed processing of large datasets. The user stories are extracted from a document that includes requirements regarding dataset management for Cask 4.0, which includes the scenarios, user stories and a design for the implementation of these user stories. The raw data is available in the following environment.

g18-neurohub.txt (2012) is concerned with the NeuroHub platform, a neuroscience data management, analysis and collaboration platform for researchers in neuroscience to collect, store, and share data with colleagues or with the research community. The user stories were collected at a time NeuroHub was still a research project sponsored by the UK Joint Information Systems Committee (JISC). For information about the research project from which the requirements were collected, see the following record.

g22-rdadmp.txt (2018) is a collection of user stories from the Research Data Alliance's working group on DMP Common Standards. Their GitHub repository contains a collection of user stories that were created by asking the community to suggest functionality that should part of a website that manages data management plans. Each user story is stored as an issue on the GitHub's page.

g23-archivesspace.txt (2012-2013) refers to ArchivesSpace: an open source, web application for managing archives information. The application is designed to support core functions in archives administration such as accessioning; description and arrangement of processed materials including analog, hybrid, and
born digital content; management of authorities and rights; and reference service. The application supports collection management through collection management records, tracking of events, and a growing number of administrative reports. ArchivesSpace is open source and its
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Antoine Pietri (2020). The Software Heritage Graph Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_2583977

The Software Heritage Graph Dataset

Explore at:

Dataset updated

Jan 24, 2020

Dataset provided by

Diomidis Spinellis
Antoine Pietri
Stefano Zacchiroli

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Software Heritage is the largest existing public archive of software source code and accompanying development history: it currently spans more than five billion unique source code files and one billion unique commits, coming from more than 80 million software projects.

This is the Software Heritage graph dataset: a fully-deduplicated Merkle DAG representation of the Software Heritage archive. The dataset links together file content identifiers, source code directories, Version Control System (VCS) commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls. The dataset’s contents come from major development forges (including GitHub and GitLab), FOSS distributions (e.g., Debian), and language-specific package managers (e.g., PyPI). Crawling information is also included, providing timestamps about when and where all archived source code artifacts have been observed in the wild.

The Software Heritage graph dataset is available in multiple formats, including downloadable CSV dumps and Apache Parquet files for local use, as well as a public instance on Amazon Athena interactive query service for ready-to-use powerful analytical processing.

By accessing the dataset, you agree with the Software Heritage Ethical Charter for using the archive data, and the terms of use for bulk access.

If you use this dataset for research purposes, please cite the following paper:

Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli. The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Co-located with ICSE 2019. preprint, bibtex

You can also refer to the above paper for more information the dataset and sample queries.

Clear search

Close search

Google apps

Main menu

The Software Heritage Graph Dataset

Online File Sharing Platforms Market Report | Global Forecast From 2025 To...

Online File Sharing Platforms Market Outlook

Type Analysis

Frictionless Data Standards Compliance: Stores metadata as datapackage.json...

Supplement 1. Code for conducting the analyses and generating the figures in...

Supplement 1. Code for conducting the analyses and generating the figures in...

Data from: Training Data for the NeonTreeEvaluation Benchmark

Software Heritage Graph Dataset

BiNA: A Visual Analytics Tool for Biological Network Data

Requirements data sets (user stories)

Overview of the datasets [data and links added in December 2024]

Public administration and transparency

(Research) data and meta-data management

The Software Heritage Graph Dataset