100+ datasets found

NIDDK Central Repository - fj8i-77zk - Archive Repository
healthdata.gov
application/rdfxml +5
Updated Aug 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). NIDDK Central Repository - fj8i-77zk - Archive Repository [Dataset]. https://healthdata.gov/dataset/NIDDK-Central-Repository-fj8i-77zk-Archive-Reposit/7phz-ieud
Explore at:
application/rssxml, csv, tsv, json, xml, application/rdfxmlAvailable download formats
Dataset updated
Aug 18, 2023
Description
This dataset tracks the updates made on the dataset "NIDDK Central Repository" as a repository for previous versions of the data and metadata.
Number of open source projects and versions worldwide 2023, by ecosystem
statista.com
Updated Jul 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Number of open source projects and versions worldwide 2023, by ecosystem [Dataset]. https://www.statista.com/statistics/1268650/worldwide-open-source-projects-versions-ecosystems/
Explore at:
Dataset updated
Jul 1, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2023
Area covered
Worldwide
Description
At the end of 2022, there were approximately *** million JavaScript open source projects in the Maven Central Repository and around ** million JavaScript project versions worldwide. While JavaScript is the largest ecosystem in the Maven Central Repository, Java, Python, and .NET also have thousands of available open source projects.
D
2026-06-26 - NIDDK Central Repository - CoreTrustSeal Requirements 2020-2022...
dataverse.nl
pdf
Updated Mar 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NIDDK Central Repository; NIDDK Central Repository (2024). 2026-06-26 - NIDDK Central Repository - CoreTrustSeal Requirements 2020-2022 [Dataset]. http://doi.org/10.34894/NOYYSF
Explore at:
pdf(230725)Available download formats
Unique identifier
https://doi.org/10.34894/NOYYSF
Dataset updated
Mar 26, 2024
Dataset provided by
DataverseNL
Authors
NIDDK Central Repository; NIDDK Central Repository
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
CoreTrustSeal certification
Central Park Follow Up - raes-ukcy - Archive Repository
healthdata.gov
application/rdfxml +5
Updated Jul 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Central Park Follow Up - raes-ukcy - Archive Repository [Dataset]. https://healthdata.gov/dataset/Central-Park-Follow-Up-raes-ukcy-Archive-Repositor/iftq-java
Explore at:
xml, csv, application/rssxml, json, application/rdfxml, tsvAvailable download formats
Dataset updated
Jul 26, 2023
Description
This dataset tracks the updates made on the dataset "Central Park Follow Up" as a repository for previous versions of the data and metadata.
Central Park - 89bt-rfpj - Archive Repository
healthdata.gov
application/rdfxml +5
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Central Park - 89bt-rfpj - Archive Repository [Dataset]. https://healthdata.gov/dataset/Central-Park-89bt-rfpj-Archive-Repository/jcv9-meqc
Explore at:
json, csv, tsv, application/rssxml, xml, application/rdfxmlAvailable download formats
Dataset updated
Jul 25, 2023
Description
This dataset tracks the updates made on the dataset "Central Park" as a repository for previous versions of the data and metadata.
Z
Qualisign: Software Metrics and GoF Design Patterns of the Maven Central...
data.niaid.nih.gov
zenodo.org
Updated Sep 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aichberger, Johann (2020). Qualisign: Software Metrics and GoF Design Patterns of the Maven Central Repository [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3731871
Explore at:
Dataset updated
Sep 24, 2020
Dataset authored and provided by
Aichberger, Johann
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains software metric and design pattern data for around 100,000 projects from the Maven Central repository. The data was collected and analyzed as part of my master's thesis "Mining Software Repositories for the Effects of Design Patterns on Software Quality" (https://www.overleaf.com/read/vnfhydqxmpvx, https://zenodo.org/record/4048275).

The included qualisign.* files all contain the same data in different formats: - qualisign.sql: standard SQL format (exported using "pg_dump --inserts ..."), - qualisign.psql: PostgreSQL plain format (exported using "pg_dump -Fp ..."), - qualisign.csql: PostgreSQL custom format (exported using "pg_dump -Fc ...").

create-tables.sql has to be executed before importing one of the qualisign.* files. Once qualisign.*sql has been imported, create-views.sql can be executed to preprocess the data, thereby creating materialized views that are more appropriate for data analysis purposes.

Software metrics were calculated using CKJM extended: http://gromit.iiar.pwr.wroc.pl/p_inf/ckjm/

Included software metrics are (21 total): - AMC: Average Method Complexity - CA: Afferent Coupling - CAM: Cohesion Among Methods - CBM: Coupling Between Methods - CBO: Coupling Between Objects - CC: Cyclomatic Complexity - CE: Efferent Coupling - DAM: Data Access Metric - DIT: Depth of Inheritance Tree - IC: Inheritance Coupling - LCOM: Lack of Cohesion of Methods (Chidamber and Kemerer) - LCOM3: Lack of Cohesion of Methods (Constantine and Graham) - LOC: Lines of Code - MFA: Measure of Functional Abstraction - MOA: Measure of Aggregation - NOC: Number of Children - NOM: Number of Methods - NOP: Number of Polymorphic Methods - NPM: Number of Public Methods - RFC: Response for Class - WMC: Weighted Methods per Class

In the qualisign.* data, these metrics are only available on the class level. create-views.sql additionally provides averages of these metrics on the package and project levels.

Design patterns were detected using SSA: https://users.encs.concordia.ca/~nikolaos/pattern_detection.html

Included design patterns are (15 total): - Adapter - Bridge - Chain of Responsibility - Command - Composite - Decorator - Factory Method - Observer - Prototype - Proxy - Singleton - State - Strategy - Template Method - Visitor

The code to generate the dataset is available at: https://github.com/jaichberg/qualisign

The code to perform quality analysis on the dataset is available at: https://github.com/jaichberg/qualisign-analysis
d
Data from: Data sharing through an NIH central database repository: a...
datadryad.org
data.niaid.nih.gov
+1more
zip
Updated Sep 2, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joseph S. Ross; Jessica D. Ritchie; Emily Finn; Nihar R. Desai; Richard L. Lehman; Harlan M. Krumholz; Cary P. Gross (2016). Data sharing through an NIH central database repository: a cross-sectional survey of BioLINCC users [Dataset]. http://doi.org/10.5061/dryad.j38b7
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.j38b7
Dataset updated
Sep 2, 2016
Dataset provided by
Dryad
Authors
Joseph S. Ross; Jessica D. Ritchie; Emily Finn; Nihar R. Desai; Richard L. Lehman; Harlan M. Krumholz; Cary P. Gross
Time period covered
Aug 31, 2016
Description
Dryad BioLINCC Survey Data 16-09-01This is the deidentified data from the 2015 cross-sectional survey of investigators who requested and received access to clinical research data from BioLINCC between 2007 and 2014.READ ME Dryad BioLINCC Survey 16-09-01.txtData Dictionary BioLINCC Survey 16-09-01This file lists and describes the variables from the 2015 cross-sectional BioLINCC survey.
d
CCMMercury System -.
datadiscoverystudio.org
Updated Mar 1, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2017). CCMMercury System -. [Dataset]. http://datadiscoverystudio.org/geoportal/rest/metadata/item/8a769563b5a0462eada9e85e2d19d094/html
Explore at:
Dataset updated
Mar 1, 2017
Description
description: The CCMMercury System IS a correspondence tracking (or control) system which (l) provides a central repository for agency correspondence, (2) tracks and manages correspondence, and (3) tracks and manages correspondence letters.; abstract: The CCMMercury System IS a correspondence tracking (or control) system which (l) provides a central repository for agency correspondence, (2) tracks and manages correspondence, and (3) tracks and manages correspondence letters.
f
Views regarding the format of data and governance arrangements for a central...
plos.figshare.com
xls
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Catrin Tudur Smith; Kerry Dwan; Douglas G. Altman; Mike Clarke; Richard Riley; Paula R. Williamson (2023). Views regarding the format of data and governance arrangements for a central repository of IPD. [Dataset]. http://doi.org/10.1371/journal.pone.0097886.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0097886.t001
Dataset updated
May 30, 2023
Dataset provided by
PLOS ONE
Authors
Catrin Tudur Smith; Kerry Dwan; Douglas G. Altman; Mike Clarke; Richard Riley; Paula R. Williamson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Responders could provide more than one reason so the numbers do not add to 30.13 responders recorded two formats.28 responders recorded two governance issues, 1 responder recorded three governance issues, 2 responders recorded four governance issues, 1 responder recorded five governance issues.
w
Joint Asset Recovery Database
data.wu.ac.at
data.europa.eu
Updated Dec 12, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Home Office (2013). Joint Asset Recovery Database [Dataset]. https://data.wu.ac.at/schema/data_gov_uk/ZTRiMzJkMDQtN2Q0MS00M2E1LWFkMzAtZmFlZDUxY2E1MDQ1
Explore at:
Dataset updated
Dec 12, 2013
Dataset provided by
Home Office
Description
A central repository of information relating to seizures of the Proceeds of Crime.
Utilization of open source projects worldwide 2021, by ecosystem
statista.com
Updated Jan 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Utilization of open source projects worldwide 2021, by ecosystem [Dataset]. https://www.statista.com/statistics/1268859/worldwide-open-source-projects-utilization-share-ecosystems/
Explore at:
Dataset updated
Jan 9, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Jul 31, 2021
Area covered
Worldwide
Description
At the end of July 2021, there were roughly 1.9 million JavaScript open source projects in the Maven Central Repository and 21 million JavaScript project versions worldwide. While JavaScript was the largest ecosystem for open source projects at that time, it also had one of the lowest ecosystem project utilization, with only 2 percent. Whereas, Java had the highest ecosystem project utilization with 15 percent.
p
OverdoseFreePA Repository TAC, UPITT Pharmacy and PCCD
data.pa.gov
application/rdfxml +5
Updated Jul 12, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pennsylvania Overdose Reduction Technical Assistance Center (TAC), University of Pittsburgh School of Pharmacy (2018). OverdoseFreePA Repository TAC, UPITT Pharmacy and PCCD [Dataset]. https://data.pa.gov/w/nyv8-wsd2/33ch-zxdi?cur=Tt3IRx1pmm-&from=root
Explore at:
tsv, xml, application/rdfxml, application/rssxml, csv, jsonAvailable download formats
Dataset updated
Jul 12, 2018
Dataset authored and provided by
Pennsylvania Overdose Reduction Technical Assistance Center (TAC), University of Pittsburgh School of Pharmacy
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
OverdoseFreePA OverdoseFreePA is made possible by the Pennsylvania Commission on Crime and Delinquency, and is directed and managed by the Pennsylvania Overdose Reduction Technical Assistance Center (TAC), University of Pittsburgh School of Pharmacy. The website is a result of collaboration with county and state partners across the Commonwealth of Pennsylvania.

Our partnerships include:

Pennsylvania District Attorneys Association Pennsylvania Medical Society Pennsylvania Pharmacist Association Pennsylvania Psychiatric Society The Hospital and Healthsystem Association of Pennsylvania Pennsylvania Dental Association Drug Enforcement Administration 360 Strategy There are a growing number of Pennsylvania counties involved in ramping up overdose prevention, treatment, and recovery activities to address the opioid overdose epidemic. The counties involved are collaborating to develop resources that can be used by all Pennsylvanians to increase community awareness and knowledge of overdose and overdose prevention strategies as well as to support initiatives aimed at decreasing drug overdoses and deaths within the participating counties. As a centralized resource and technical assistance hub, OverdoseFreePA is a central repository for these efforts to facilitate increased treatment and prevention efforts in these communities.

Pennsylvania Opioid Overdose Reduction Technical Assistance Center (TAC) Pennsylvania, and the nation at large, is in the midst of opioid overdose epidemic. The TAC’s vision is to lead Pennsylvania communities to zero overdoses.The TAC hopes to achieve this vision by providing concierge technical assistance in the form of data driven recommendations and customized strategic planning to counties working to eliminate overdoses. The TAC strives to lead the field in identifying and sharing strategies to eliminate overdose through the central repository of OverdoseFreePA.

Based out of the Program Evaluation and Research Unit (PERU) at the University of Pittsburgh’s School of Pharmacy, the TAC assists counties and communities in assessing needs, building capacity to address the needs, developing and implementing data driven plans with high quality outcomes, and sustaining initiatives to eliminate overdoses, both fatal and non-fatal, throughout Pennsylvania.

More information here -http://www.overdosefreepa.pitt.edu/who-we-are/
Central Elementary - 9d7y-yhi8 - Archive Repository
healthdata.gov
application/rdfxml +5
Updated Jul 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Central Elementary - 9d7y-yhi8 - Archive Repository [Dataset]. https://healthdata.gov/dataset/Central-Elementary-9d7y-yhi8-Archive-Repository/d84z-jwpn
Explore at:
csv, tsv, application/rssxml, xml, application/rdfxml, jsonAvailable download formats
Dataset updated
Jul 26, 2023
Description
This dataset tracks the updates made on the dataset "Central Elementary" as a repository for previous versions of the data and metadata.
g
Observations of bullseye snakehead (Channa marulius) in Florida | gimi9.com
gimi9.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Observations of bullseye snakehead (Channa marulius) in Florida | gimi9.com [Dataset]. https://gimi9.com/dataset/data-gov_observations-of-bullseye-snakehead-channa-marulius-in-florida/
Explore at:
Area covered
Florida
Description
This dataset contains information on the Bullseye Snakehead fish found only in southeastern Florida. It is a subset of a larger database, the Nonindigenous Aquatic Species Database (NAS). This information resource is an established central repository for spatially referenced biogeographic accounts of introduced aquatic species. The NAS website provides scientific reports, online/real-time queries, spatial data sets, distribution maps, fact sheets, and general information.
Unified ICM/Unified CCE Databases
catalog.data.gov
datasets.ai
+2more
Updated Mar 8, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Social Security Administration (2025). Unified ICM/Unified CCE Databases [Dataset]. https://catalog.data.gov/dataset/unified-icm-unified-cce-databases
Explore at:
Dataset updated
Mar 8, 2025
Dataset provided by
Social Security Administrationhttp://ssa.gov/
Description
Unified ICM/Unified CCE software uses information in the central database to determine how to route N8NN calls, including information about telephone system configuration and routingscripts. The local database also contains tables of real-time information that describe activity at the callcenters. Historical information is stored in the central database.
A
Nonindigenous Aquatic Species Database Asian Tiger Shrimp
data.amerigeoss.org
Updated Jul 15, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ioos (2019). Nonindigenous Aquatic Species Database Asian Tiger Shrimp [Dataset]. https://data.amerigeoss.org/de/dataset/nonindigenous-aquatic-species-database-asian-tiger-shrimp
Explore at:
Dataset updated
Jul 15, 2019
Dataset provided by
ioos
Description
The Nonindigenous Aquatic Species Database (NAS) information resource is an established central repository for spatially referenced biogeographic accounts of introduced aquatic species. The NAS website provides scientific reports, online/real-time queries, spatial data sets, distribution maps, fact sheets, and general information.
d
Asset database for the Central West subregion on 29 April 2015
data.gov.au
cloud.csiss.gmu.edu
+2more
Updated Nov 19, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bioregional Assessment Program (2019). Asset database for the Central West subregion on 29 April 2015 [Dataset]. https://data.gov.au/data/dataset/5c3f9a56-7a48-4c26-a617-a186c2de5bf7
Explore at:
Dataset updated
Nov 19, 2019
Dataset authored and provided by
Bioregional Assessment Program
Description
Abstract

The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

This database is an initial Asset database for the Central West subregion on 29 April 2015. This dataset contains the spatial and non-spatial (attribute) components of the Central West subregion Asset List as one .mdb files, which is readable as an MS Access database and a personal geodatabase. Under the BA program, a spatial assets database is developed for each defined bioregional assessment project. The spatial elements that underpin the identification of water dependent assets are identified in the first instance by regional NRM organisations (via the WAIT tool) and supplemented with additional elements from national and state/territory government datasets. All reports received associated with the WAIT process for Central West are included in the zip file as part of this dataset. Elements are initially included in the preliminary assets database if they are partly or wholly within the subregion's preliminary assessment extent (Materiality Test 1, M1). Elements are then grouped into assets which are evaluated by project teams to determine whether they meet the second Materiality Test (M2). Assets meeting both Materiality Tests comprise the water dependent asset list. Descriptions of the assets identified in the Central West subregion are found in the "AssetList" table of the database. In this version of the database only M1 has been assessed. Assets are the spatial features used by project teams to model scenarios under the BA program. Detailed attribution does not exist at the asset level. Asset attribution includes only the core set of BA-derived attributes reflecting the BA classification hierarchy, as described in Appendix A of "CEN_asset_database_doc_20150429.doc ", located in the zip file as part of this dataset. The "Element_to_Asset" table contains the relationships and identifies the elements that were grouped to create each asset. Detailed information describing the database structure and content can be found in the document "CEN_asset_database_doc_20150429.doc" located in the zip file. Some of the source data used in the compilation of this dataset is restricted.

Dataset History

This is initial asset database.

The Bioregional Assessments methodology (Barrett et al., 2013) defines a water-dependent asset as a spatially distinct, geo-referenced entity contained within a bioregion with characteristics having a defined cultural indigenous, economic or environmental value, and that can be linked directly or indirectly to a dependency on water quantity and/or quality.

Under the BA program, a spatial assets database is developed for each defined bioregional assessment project. The spatial elements that underpin the identification of water dependent assets are identified in the first instance by regional NRM organisations (via the WAIT tool) and supplemented with additional elements from national and state/territory government datasets. Elements are initially included in database if they are partly or wholly within the subregion's preliminary assessment extent (Materiality Test 1, M1). Elements are then grouped into assets which are evaluated by project teams to determine whether they meet materiality test 2 (M2) - assets considered to be water dependent.

Elements may be represented by a single, discrete spatial unit (polygon, line or point), or a number of spatial units occurring at more than one location (multipart polygons/lines or multipoints). Spatial features representing elements are not clipped to the preliminary assessment extent - features that extend beyond the boundary of the assessment extent have been included in full. To assist with an assessment of the relative importance of elements, area statements have been included as an attribute of the spatial data. Detailed attribute tables contain descriptions of the geographic features at the element level. Tables are organised by data source and can be joined to the spatial data on the "ElementID" field

Elements are grouped into Assets, which are the objects used by project teams to model scenarios under the BA program. Detailed attribution does not exist at the asset level. Asset attribution includes only the core set of BA-derived attributes reflecting the BA classification hierarchy.

The "Element_to_asset" table contains the relationships and identifies the elements that were grouped to create each asset.

Following delivery of the first pass asset list, project teams make a determination as to whether an asset (comprised of one or more elements) is water dependent, as assessed against the materiality tests detailed in the BA Methodology. These decisions are provided to ERIN by the project team leader and incorporated into the Assetlist table in the Asset database. The Asset database is then re-registered into the BA repository.

The Asset database dataset (which is registered to the BA repository) contains separate spatial and non-spatial databases.

Non-spatial (tabular data) is provided in an ESRI personal geodatabase (.mdb - doubling as a MS Access database) to store, query, and manage non-spatial data. This database can be accessed using either MS Access or ESRI GIS products. Non-spatial data has been provided in the Access database to simplify the querying process for BA project teams. Source datasets are highly variable and have different attributes, so separate tables are maintained in the Access database to enable the querying of thematic source layers.

Spatial data is provided as an ESRI file geodatabase (.gdb), and can only be used in an ESRI GIS environment. Spatial data is represented as a series of spatial feature classes (point, line and polygon layers). Non-spatial attribution can be joined from the Access database using the AID and ElementID fields, which are common to both the spatial and non-spatial datasets. Spatial layers containing all the point, line and polygon - derived elements and assets have been created to simplify management of the Elementlist and Assetlist tables, which list all the elements and assets, regardless of the spatial data geometry type. i.e. the total number of features in the combined spatial layers (points, lines, polygons) for assets (and elements) is equal to the total number of non-spatial records of all the individual data sources.

Dataset Citation

Department of the Environment (2013) Asset database for the Central West subregion on 29 April 2015. Bioregional Assessment Derived Dataset. Viewed 08 February 2017, http://data.bioregionalassessments.gov.au/dataset/5c3f9a56-7a48-4c26-a617-a186c2de5bf7.

Dataset Ancestors

Derived From Macquarie Marshes Vegetation 1991-2008 VIS_ID 3920

Derived From NSW Office of Water GW licence extract linked to spatial locations NIC v2 (28 February 2014)

Derived From NSW Office of Water Surface Water Entitlements Locations v1_Oct2013

Derived From Travelling Stock Route Conservation Values

Derived From NSW Wetlands

Derived From Communities of National Environmental Significance Database - RESTRICTED - Metadata only

Derived From National Groundwater Dependent Ecosystems (GDE) Atlas

Derived From Birds Australia - Important Bird Areas (IBA) 2009

Derived From Environmental Asset Database - Commonwealth Environmental Water Office

Derived From NSW Office of Water Surface Water Offtakes - NIC v1 20131024

Derived From National Groundwater Dependent Ecosystems (GDE) Atlas (including WA)

Derived From Species Profile and Threats Database (SPRAT) - Australia - Species of National Environmental Significance Database (BA subset - RESTRICTED - Metadata only)

Derived From Ramsar Wetlands of Australia

Derived From Native Vegetation Management (NVM) - Manage Benefits

Derived From Key Environmental Assets - KEA - of the Murray Darling Basin

Derived From National Heritage List Spatial Database (NHL) (v2.1)

Derived From Climate Change Corridors (Dry Habitat) for North East NSW

Derived From Great Artesian Basin and Laura Basin groundwater recharge areas

Derived From NSW Office of Water combined geodatabase of regulated rivers and water sharing plan regions

Derived From [New South Wales NSW Regional CMA Water Asset
RecFIN Database
fisheries.noaa.gov
Updated Apr 2, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pacific States Marine Fisheries Commission (2019). RecFIN Database [Dataset]. https://www.fisheries.noaa.gov/inport/item/55990
Explore at:
Dataset updated
Apr 2, 2019
Dataset provided by
Pacific States Marine Fisheries Commission
Description
The Recreational Fisheries Information Network (RecFIN) database is a centralized repository for marine recreational fisheries data from California, Oregon, and Washington data collection programs.
Replication package for: Altered Histories in Version Control System...
zenodo.org
bin, zip
Updated Jun 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous; Anonymous (2025). Replication package for: Altered Histories in Version Control System Repositories: Evidence from the Trenches [Dataset]. http://doi.org/10.5281/zenodo.15558282
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15558282
Dataset updated
Jun 2, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous; Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description

# History Alterations - Replication Package

This repository contains the complete replication package for the research article Altered Histories in Version Control System Repositories: Evidence from the Trenches. The package provides tools to detect, analyze, and categorize Git history alterations across software repositories, along with Jupyter notebooks to reproduce the analysis presented in the paper.

## 📋 Table of Contents

- Overview

- Repository Structure

- Quick Start

- Reproducing the Analysis

- Data

- Tools Description

- Requirements

- Citation

## 🔍 Overview

This replication package enables researchers to reproduce the analysis of altered Git histories in software repositories archived by Software Heritage. The study investigates how and why Git histories are modified over time, providing insights into developer practices and repository maintenance patterns.

Main Research Questions:

- How prevalent are Git history alterations in open-source repositories?

- What types of changes are most commonly made to Git histories?

- What are the root causes of these alterations?

- How do these practices vary across different types of repositories?

## 📁 Repository Structure

</div> <div>├── README.md # This file</div> <div>├── data/ # Pre-computed datasets</div> <div>│ ├── ...</div> <div>├── altered-history/ # Main analysis tool</div> <div>│ ├── src/ # Rust source code</div> <div>│ ├── notebooks/ # Analysis notebooks</div> <div>│ │ ├── analysis.ipynb # Main analysis notebook</div> <div>│ │ ├── build_analysis_dataset.ipynb</div> <div>│ │ └── utils_analysis.py # Analysis utilities</div> <div>│ └── README.md</div> <div>├── git-historian/ # History checking tool</div> <div>│ ├── src/ # Rust source code</div> <div>│ └── README.md</div> <div>├── modified-files/ # File modification analysis tool</div> <div>│ ├── src/ # Rust source code</div> <div>│ ├── notebooks/ # Additional analysis notebooks</div> <div>│ │ ├── license_analysis.ipynb</div> <div>│ │ ├── license_categorization.py</div> <div>│ │ ├── secret-analysis.ipynb</div> <div>│ │ └── swh_license_files.py</div> <div>│ └── README.md</div> <div>

## 🚀 Quick Start

### Prerequisites

- Rust (latest stable version)

- Python 3.8+ with Jupyter

- PostgreSQL (for database operations)

- Git (for repository analysis)

### Installation

1. Clone the repository:

bash</div> <div>git clone <repository-url></div> <div>cd altered-histories-tool-replication-pkg</div> <div>

2. Unzip all directories

3. Install Python dependencies:

bash</div> <div>pip install pandas matplotlib seaborn jupyter plotly numpy</div> <div>

4. Build the Rust tools (optional, for dataset generation):

bash</div> <div>cd altered-history && cargo build --release && cd ..</div> <div>cd git-historian && cargo build --release && cd ..</div> <div>cd modified-files && cargo build --release && cd ..</div> <div>

## 📊 Reproducing the Analysis

### Option 1: Using Pre-computed Data (Recommended)

The data/ directory contains pre-computed datasets that allow you to reproduce all analyses without running the computationally intensive data collection process.

1. Open the main analysis notebook:

bash</div> <div>cd altered-history/notebooks</div> <div>jupyter notebook analysis.ipynb</div> <div>

2. Run all cells to reproduce the complete analysis.

3. Explore additional analyses:

Modify notebooks at will to explore the dataframe.

bash</div> <div># Build analysis dataset (shows data preparation)</div> <div>jupyter notebook build_analysis_dataset.ipynb</div> <div> </div> <div># License-related analysis</div> <div>cd ../../modified-files/notebooks</div> <div>jupyter notebook license_analysis.ipynb</div> <div> </div> <div># Security and secrets analysis</div> <div>jupyter notebook secret-analysis.ipynb</div> <div>

### Option 2: Regenerating the Dataset

To reproduce the complete data collection and analysis pipeline:

1. Download Software Heritage datasets (see individual tool READMEs)

2. Configure database connections in each tool

3. Run the analysis pipeline following the step-by-step instructions in each tool's README

4. Process results using the provided notebooks

Note: Complete dataset regeneration requires significant computational resources and time (potentially weeks for large datasets).

## 📋 Data

The data/ directory contains several key datasets including:

- res.pkl: Main analysis results containing categorized alterations

- stars_without_dup.pkl: Repository popularity metrics (GitHub stars)

- visit_type.pkl: Classification of repository visit patterns

- altered_histories_2024_08_23.dump: PostgreSQL database dump for git-historian tool

## 🛠️ Tools Description

### 1. altered-history

Purpose: Detects and categorizes Git history alterations in Software Heritage archives.

Key Features:

- Three-step analysis pipeline (detection → root cause → categorization)

- Parallel processing for large datasets

- Comprehensive alteration taxonomy

Usage: See altered-history/README.md for detailed instructions.

### 2. git-historian

Purpose: Checks individual repositories against the database of known alterations.

Key Features:

- PostgreSQL integration

- Git hook integration for automated checking

- Caching system for performance

Usage: See git-historian/README.md for detailed instructions.

### 3. modified-files

Purpose: Analyzes file-level modifications and their patterns.

Key Features:

- File modification tracking

- License and security analysis

- Integration with Software Heritage graph

Usage: See modified-files/README.md for detailed instructions.

## 📋 Requirements

### System Requirements

- Memory: Minimum 16GB RAM (1.5TB+ recommended for full dataset processing)

- Storage: 600GB+ free space for complete datasets

- CPU: Multi-core processor recommended for parallel processing

## 🔄 Reproducibility Notes

1. Deterministic Results: The analysis notebooks will produce identical results when run with the provided datasets.

2. Versioning: All tools are pinned to specific versions to ensure reproducibility.

3. Random Seeds: Where applicable, random seeds are fixed in the analysis code.
A
Archive of Geosample Data and Information from the University of Rhode...
data.amerigeoss.org
datadiscoverystudio.org
+1more
html, jsp, ods
Updated Jul 28, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United States[old] (2019). Archive of Geosample Data and Information from the University of Rhode Island (URI) Graduate School of Oceanography (GSO), Marine Geological Samples Laboratory (MGSL) [Dataset]. https://data.amerigeoss.org/fi/dataset/a69f5588-c4c1-48e8-b213-f8fd0ad6b4a0
Explore at:
html, jsp, odsAvailable download formats
Dataset updated
Jul 28, 2019
Dataset provided by
United States[old]
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Area covered
Rhode Island
Description
The Marine Geological Samples Laboratory (MGSL) of the Graduate School of Oceanography (GSO), University of Rhode Island is a partner in the Index to Marine and Lacustrine Geological Samples (IMLGS) database, contributing information to the IMLGS to help researchers discover geological samples curated in their facility. The partner repository also sends some related data, documents, and imagery to NCEI for long-term archive, but the originating institution is the definitive source of information related to their sample collection. The MGSL serves as the central repository for dredge rocks, deep-sea cores, grabs and land-based geological samples collected by the Marine Geology and Geophysics group at GSO/URI. The facility is located on the Narragansett Bay Campus of the University of Rhode Island in Narragansett, R.I. A large part of the funding for curatorial activities in the MGSL is obtained from the Ocean Science Division of the National Science Foundation. The MGSL maintains a large collection of marine geological samples

Facebook

Twitter

Click to copy link

Link copied

Cite

(2023). NIDDK Central Repository - fj8i-77zk - Archive Repository [Dataset]. https://healthdata.gov/dataset/NIDDK-Central-Repository-fj8i-77zk-Archive-Reposit/7phz-ieud

NIDDK Central Repository - fj8i-77zk - Archive Repository

Explore at:

application/rssxml, csv, tsv, json, xml, application/rdfxmlAvailable download formats

Dataset updated

Aug 18, 2023

Description

This dataset tracks the updates made on the dataset "NIDDK Central Repository" as a repository for previous versions of the data and metadata.

Clear search

Close search

Google apps

Main menu

NIDDK Central Repository - fj8i-77zk - Archive Repository

Number of open source projects and versions worldwide 2023, by ecosystem

2026-06-26 - NIDDK Central Repository - CoreTrustSeal Requirements 2020-2022...

Central Park Follow Up - raes-ukcy - Archive Repository

Central Park - 89bt-rfpj - Archive Repository

Qualisign: Software Metrics and GoF Design Patterns of the Maven Central...

Data from: Data sharing through an NIH central database repository: a...

CCMMercury System -.

Views regarding the format of data and governance arrangements for a central...

Joint Asset Recovery Database

Utilization of open source projects worldwide 2021, by ecosystem

OverdoseFreePA Repository TAC, UPITT Pharmacy and PCCD

Central Elementary - 9d7y-yhi8 - Archive Repository

Observations of bullseye snakehead (Channa marulius) in Florida | gimi9.com

Unified ICM/Unified CCE Databases

Nonindigenous Aquatic Species Database Asian Tiger Shrimp

Asset database for the Central West subregion on 29 April 2015

Abstract

Dataset History

Dataset Citation

Dataset Ancestors

RecFIN Database

Replication package for: Altered Histories in Version Control System...

Archive of Geosample Data and Information from the University of Rhode...

NIDDK Central Repository - fj8i-77zk - Archive Repository