29 datasets found

Dataset of a Study of Computational reproducibility of Jupyter notebooks...
zenodo.org
pdf, zip
Updated Jul 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen (2024). Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications [Dataset]. http://doi.org/10.5281/zenodo.8226725
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8226725
Dataset updated
Jul 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This repository contains the dataset for the study of computational reproducibility of Jupyter notebooks from biomedical publications. Our focus lies in evaluating the extent of reproducibility of Jupyter notebooks derived from GitHub repositories linked to publications present in the biomedical literature repository, PubMed Central. We analyzed the reproducibility of Jupyter notebooks from GitHub repositories associated with publications indexed in the biomedical literature repository PubMed Central. The dataset includes the metadata information of the journals, publications, the Github repositories mentioned in the publications and the notebooks present in the Github repositories.

Data Collection and Analysis

We use the code for reproducibility of Jupyter notebooks from the study done by Pimentel et al., 2019 and adapted the code from ReproduceMeGit. We provide code for collecting the publication metadata from PubMed Central using NCBI Entrez utilities via Biopython.

Our approach involves searching PMC using the esearch function for Jupyter notebooks using the query: ``(ipynb OR jupyter OR ipython) AND github''. We meticulously retrieve data in XML format, capturing essential details about journals and articles. By systematically scanning the entire article, encompassing the abstract, body, data availability statement, and supplementary materials, we extract GitHub links. Additionally, we mine repositories for key information such as dependency declarations found in files like requirements.txt, setup.py, and pipfile. Leveraging the GitHub API, we enrich our data by incorporating repository creation dates, update histories, pushes, and programming languages.

All the extracted information is stored in a SQLite database. After collecting and creating the database tables, we ran a pipeline to collect the Jupyter notebooks contained in the GitHub repositories based on the code from Pimentel et al., 2019.

Our reproducibility pipeline was started on 27 March 2023.

Repository Structure

Our repository is organized into two main folders:

archaeology: This directory hosts scripts designed to download, parse, and extract metadata from PubMed Central publications and associated repositories. There are 24 database tables created which store the information on articles, journals, authors, repositories, notebooks, cells, modules, executions, etc. in the db.sqlite database file.

analyses: Here, you will find notebooks instrumental in the in-depth analysis of data related to our study. The db.sqlite file generated by running the archaelogy folder is stored in the analyses folder for further analysis. The path can however be configured in the config.py file. There are two sets of notebooks: one set (naming pattern N[0-9]*.ipynb) is focused on examining data pertaining to repositories and notebooks, while the other set (PMC[0-9]*.ipynb) is for analyzing data associated with publications in PubMed Central, i.e.\ for plots involving data about articles, journals, publication dates or research fields. The resultant figures from the these notebooks are stored in the 'outputs' folder.

MethodsWorkflow: The MethodsWorkflow file provides a conceptual overview of the workflow used in this study.

Accessing Data and Resources:

All the data generated during the initial study can be accessed at https://doi.org/10.5281/zenodo.6802158

For the latest results and re-run data, refer to this link.

The comprehensive SQLite database that encapsulates all the study's extracted data is stored in the db.sqlite file.

The metadata in xml format extracted from PubMed Central which contains the information about the articles and journal can be accessed in pmc.xml file.

System Requirements:

Centos 7 (Documentation: https://www.centos.org/)

Conda 4.9.4 (Installation Guide: https://docs.anaconda.com/anaconda/install/linux/)

Python 3.7.6 (Download Link: https://www.python.org/downloads/)

GitHub account (Get Started: https://github.com/, Requires GitHub Username and Token)

gcc 7.3.0 (Installation Guide: https://gcc.gnu.org/install/)

lbzip2 (Command: `conda install -c conda-forge lbzip2')

Running the pipeline:

Clone the computational-reproducibility-pmc repository using Git:
git clone https://github.com/fusion-jena/computational-reproducibility-pmc.git

Navigate to the computational-reproducibility-pmc directory:
cd computational-reproducibility-pmc/computational-reproducibility-pmc

Configure environment variables in the config.py file:
GITHUB_USERNAME = os.environ.get("JUP_GITHUB_USERNAME", "add your github username here")
GITHUB_TOKEN = os.environ.get("JUP_GITHUB_PASSWORD", "add your github token here")

Other environment variables can also be set in the config.py file.
BASE_DIR = Path(os.environ.get("JUP_BASE_DIR", "./")).expanduser() # Add the path of directory where the GitHub repositories will be saved
DB_CONNECTION = os.environ.get("JUP_DB_CONNECTION", "sqlite:///db.sqlite") # Add the path where the database is stored.

To set up conda environments for each python versions, upgrade pip, install pipenv, and install the archaeology package in each environment, execute:
source conda-setup.sh

Change to the archaeology directory
cd archaeology

Activate conda environment. We used py36 to run the pipeline.
conda activate py36

Execute the main pipeline script (r0_main.py):
python r0_main.py

Running the analysis:

Navigate to the analysis directory.
cd analyses

Activate conda environment. We use raw38 for the analysis of the metadata collected in the study.
conda activate raw38

Install the required packages using the requirements.txt file.
pip install -r requirements.txt

Launch Jupyterlab
jupyter lab

Refer to the Index.ipynb notebook for the execution order and guidance.

References:

Sheeba Samuel, Daniel Mietchen. (2024). Computational reproducibility of Jupyter notebooks from biomedical publications, https://doi.org/10.1093/gigascience/giad113, GigaScience

Sheeba Samuel, Daniel Mietchen. (2022). Computational reproducibility of Jupyter notebooks from biomedical publications, https://arxiv.org/pdf/2209.04308.pdf, CoRR abs/2209.04308

Sheeba Samuel, & Daniel Mietchen. (2022). Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6802158
Agreement Workflow Tool
catalog.data.gov
Updated Mar 8, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Social Security Administration (2025). Agreement Workflow Tool [Dataset]. https://catalog.data.gov/dataset/agreement-workflow-tool
Explore at:
Dataset updated
Mar 8, 2025
Dataset provided by
Social Security Administrationhttp://www.ssa.gov/
Description
AWT is a component of the Verification and Information Exchange Workload System (VIEWS). It is an online application which replicates the reimbursable agreement documents for review, approval and signature. AWT is available within the agency nationwide with users in various components within Headquarters, Office of Central Operations and Regional Offices across the country.
h
HZDR Data Management Strategy — Top-Level Architecture
rodare.hzdr.de
pdf
Updated Feb 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Knodel, Oliver; Gruber, Thomas; Kelling, Jeffrey; Lokamani, Mani; Müller, Stefan; Pape, David; Juckeland, Guido (2023). HZDR Data Management Strategy — Top-Level Architecture [Dataset]. http://doi.org/10.14278/rodare.2513
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.14278/rodare.2513
Dataset updated
Feb 23, 2023
Dataset provided by
Helmholtz-Zentrum Dresden - Rossendorf
Authors
Knodel, Oliver; Gruber, Thomas; Kelling, Jeffrey; Lokamani, Mani; Müller, Stefan; Pape, David; Juckeland, Guido
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This data publication contains an overview to the Top-Level Architecture of the proposed HZDR Data Management Strategy with additional description of the various systems and services.

The Helmholtz-Zentrum Dresden-Rossendorf (HZDR) pursues a comprehensive data management strategy that is designed as an architecture of services to describe and manage scientific experiments in a sustainable manner. This strategy is based on the FAIR principles and aims to ensure the findability, accessibility, interoperability and reusability of research data.
The HZDR's comprehensive data lifecycle covers all phases of the data lifecycle: from planning and collection to analysis, storage, publication and archiving. Each phase is supported by specialised services and tools that help scientists to efficiently collect, store and share their data. These services include:

Electronic lab notebook: for the digital recording and management of lab experiments and data.

Data management plans (RDMO): For planning and organising data management during a research project.

(Time Series) Databases: For structured storage and retrieval of research data.

File systems: For storing and managing files in a controlled environment.

Publication systems (ROBIS, RODARE): For the publication and accessibility of research data and results.

Metadata catalogue (SciCat): For describing data in a wide variety of subsystems using searchable metadata

Repositories (Helmholtz Codebase): For archiving, version control and provision of software, special data sets and workflows.

Proposal Management System (GATE): For the administration of project proposals and approvals.

The superordinate web service HELIPORT plays a central role here. HELIPORT acts as a gateway and connecting service that links all components of the Data Management Strategy and describes them in a sustainable manner. HELIPORT ensures standardised access to the various services and tools, which considerably simplifies collaboration and the exchange of data.
c
ckanext-rems - Extensions - CKAN Ecosystem Catalog
catalog.civicdataecosystem.org
Updated Jun 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). ckanext-rems - Extensions - CKAN Ecosystem Catalog [Dataset]. https://catalog.civicdataecosystem.org/dataset/ckanext-rems
Explore at:
Dataset updated
Jun 4, 2025
Description
The REMS (Resource Entitlement Management System) extension for CKAN brings access rights management capabilities to datasets. By integrating with REMS, this extension allows organizations to manage and control access to sensitive or restricted data through application workflows and approval processes. This enables a more secure and governed environment for data sharing within CKAN. Key Features: REMS Integration: Integrates CKAN with the REMS system for managing access rights to datasets, providing a centralized control point for permissions. Application Form and Workflow Design: Utilizes REMS' tools for designing application forms and defining workflows for requesting access to datasets. Access Request Management: Enables end-users to apply for access to datasets through the defined REMS application workflows. Workflow Processing: Provides administrators and authorized users with the tools to process access requests, manage approvals, and administer granted access rights within the REMS interface. Shibboleth Configuration Support: Supports Shibboleth configuration for authentication, potentially enabling single sign-on (SSO) capabilities for accessing REMS-protected datasets. Technical Integration: The REMS extension integrates with CKAN through configuration settings defined in the .ini file. It also utilizes the Kata extension as a dependency. Shibboleth configuration details are outlined in the config/shibboleth/README.txt file, giving direction on how to set up single sign on. The extension essentially connects CKAN datasets to the permissioning framework within a separate REMS instance. Benefits & Impact: The REMS extension provides enhanced security and control over dataset access within CKAN. This helps organizations to comply with data governance policies and regulations by enabling a structured and auditable process for granting permissions. Using a separate REMS system offloads access right activities.
c
ckanext-dsactions
catalog.civicdataecosystem.org
Updated Jun 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). ckanext-dsactions [Dataset]. https://catalog.civicdataecosystem.org/dataset/ckanext-dsactions
Explore at:
Dataset updated
Jun 4, 2025
Description
The ckanext-dsactions extension enhances CKAN by adding an "Actions" tab on a dataset's view page, visible to users with editing permissions for that dataset. This tab provides a central location for performing actions related to the dataset, with the default functionality including a dataset cloning feature. The extension is designed to be extensible, enabling administrators to add other custom actions relevant to dataset management. Key Features: Dataset Actions Tab: Introduces a dedicated "Actions" tab on dataset pages, providing a user interface for performing specific operations on a dataset. Dataset Cloning: Includes a built-in "clone" feature, allowing authorized users to create a copy of the dataset efficiently. The precise details of what is cloned (metadata only, resources, etc.) are not specified in the readme, but the intent is for efficient duplication. Extensible Design: The "Actions" tab is designed to be easily extended, allowing CKAN administrators to add custom actions and functionalities tailored to specific organizational needs. It provides flexibility in adding new scripts associated with dataset management. Database Export: Facilitates exporting the entire CKAN database using a paster command (explained below). This feature is valuable for backups, migrations, or analysis purposes. Use Cases: Data Governance: Facilitates easier copying/cloning of datasets for staging changes or for testing purposes prior to pushing changes to production. Custom Workflows: Enable custom actions such as triggering QA processes upon dataset updates, or initiating data transformation scripts. Technical Integration: The extension integrates into CKAN by adding a new tab to the dataset view page based on user permissions. It also integrates a shell command export triggered with paster. Activation requires adding dsactions to the CKAN's .ini configuration file. Further configuration details and instructions for adding custom actions are not explicitly provided in the readme but would likely involve custom plugin development. Benefits & Impact: By centralizing dataset actions and providing a mechanism for extending functionality, ckanext-dsactions streamlines dataset management for CKAN users with editing permissions. The clone feature saves time and effort compared to manually recreating datasets, and the extensibility allows organizations to customize CKAN to their specific data management workflows. The database export functionality provides a relatively straightforward method for backing up or migrating the CKAN database using a command-line tool. As a developer focused extension, the benefits are primarily directed toward ease of development and operational activities around data management.
H
Data Management Project for Collaborative Groundwater Research
hydroshare.org
zip
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abbygael Johnson; Collins Stephenson; Brett Safely; Brooklyn Taylor (2025). Data Management Project for Collaborative Groundwater Research [Dataset]. https://www.hydroshare.org/resource/faa268eaa07547938d0e696247fc81fd
Explore at:
zip(2.1 GB)Available download formats
Dataset updated
Apr 24, 2025
Dataset provided by
HydroShare
Authors
Abbygael Johnson; Collins Stephenson; Brett Safely; Brooklyn Taylor
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This project developed a comprehensive data management system designed to support collaborative groundwater research across institutions by establishing a centralized, structured database for hydrologic time series data. Built on the Observations Data Model (ODM), the system stores time series data and metadata in a relational SQLite database. Key project components included database construction, automation of data formatting and importation, development of analytical and visualization tools, and integration with ArcGIS for geospatial representation. The data import workflow standardizes and validates diverse .csv datasets by aligning them with ODM formatting. A Python-based module was created to facilitate data retrieval, analysis, visualization, and export, while an interactive map feature enables users to explore site-specific data availability. Additionally, a custom ArcGIS script was implemented to generate maps that incorporate stream networks, site locations, and watershed boundaries using DEMs from USGS sources. The system was tested using real-world datasets from groundwater wells and surface water gages across Utah, demonstrating its flexibility in handling diverse formats and parameters. The relational structure enabled efficient querying and visualization, and the developed tools promoted accessibility and alignment with FAIR principles.
d
Replication Data for: \"Analyzing Reward Dynamics and Decentralization in...
dataone.org
Updated Mar 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yan, Tao; Li, Shengnan; Kraner, Benjamin; Zhang, Luyao; Tessone, Claudio J. (2024). Replication Data for: \"Analyzing Reward Dynamics and Decentralization in Ethereum 2.0: An Advanced Data Engineering Workflow and Comprehensive Datasets for Proof-of-Stake Incentives\" [Dataset]. http://doi.org/10.7910/DVN/OKQRS1
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/OKQRS1
Dataset updated
Mar 6, 2024
Dataset provided by
Harvard Dataverse
Authors
Yan, Tao; Li, Shengnan; Kraner, Benjamin; Zhang, Luyao; Tessone, Claudio J.
Description
Ethereum 2.0, as the preeminent smart contract blockchain platform, guarantees the precise execution of applications without third-party intervention. At its core, this system leverages the Proof-of-Stake (PoS) consensus mechanism, which utilizes a stochastic process to select validators for block proposal and validation, consequently rewarding them for their contributions. However, the implementation of blockchain technology often diverges from its central tenet of decentralized consensus, presenting significant analytical challenges. Our study collects consensus reward data from the Ethereum Beacon chain and conducts a comprehensive analysis of reward distribution and evolution, categorizing them into attestation, proposer, and sync committee rewards. To evaluate the degree of decentralization in PoS Ethereum, we apply several inequality indices, including the Shannon entropy, the Gini Index, the Nakamoto Coefficient, and the Herfindahl-Hirschman Index (HHI). Our comprehensive dataset is publicly available on Harvard Dataverse, and our analytical methodologies are accessible via GitHub, promoting open-access research. Additionally, we provide insights on utilizing our data for future investigations focused on assessing, augmenting, and refining the decentralization, security, and efficiency of blockchain systems. GitHub: https://github.com/learn-want/ETH2.0-reward
Brown-Algae.dataset
zenodo.org
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahmoud Abdllah; Mahmoud Abdllah (2025). Brown-Algae.dataset [Dataset]. http://doi.org/10.5281/zenodo.14364746
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.14364746
Dataset updated
Apr 14, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mahmoud Abdllah; Mahmoud Abdllah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset folder contains essential files required for the execution and testing of the iCulture pipeline. This folder includes:

1. Reference Databases:

• Pfam-A.hmm and associated index files (*.h3f, *.h3i, *.h3m, *.h3p):

• These files constitute the Pfam database used for HMMER annotation of protein sequences.

• The database enables the identification of protein domains and families within query sequences.

2. Input FASTA Files:

• brown-algae_dataset.fa:

• This FASTA file is generated from the brown algae dataset downloaded from The Phaeoexplorer Project.

• It contains protein sequences from brown algae, formatted for compatibility with the pipeline.

• This file is used as an input for clustering and annotation during the pipeline execution.

3. Sample Input Files:

• These files are provided to facilitate testing and ensure reproducibility of the pipeline results.

Purpose:

This directory serves as a centralized location for storing datasets and databases necessary for running the iCulture pipeline, particularly in reproducibility-focused workflows.

Note: If you wish to use the pipeline with custom datasets, replace the example files in this folder with your own, following the required format.
Project vernieuwing open access monitoring - peer-reviewed artikelen...
zenodo.org
csv, txt
Updated Apr 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bianca Kramer; Bianca Kramer (2025). Project vernieuwing open access monitoring - peer-reviewed artikelen [Dataset] [Dataset]. http://doi.org/10.5281/zenodo.15164365
Explore at:
txt, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15164365
Dataset updated
Apr 21, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Bianca Kramer; Bianca Kramer
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Deze dataset behoort bij het rapport "Project vernieuwing open access monitoring - Rapportage fase 1 - peer-reviewed artikelen" (https://doi.org/10.5281/zenodo.15061685). De uitgangspunten van het project waren het opzetten van een transparante en reproduceerbare workflow voor open access monitoring van peer-reviewed artikelen van de Nederlandse universiteiten, gebruikmakend van open data en met code en data die geheel open gedeeld kunnen worden.

De dataset omvat record-level informatie over peer-reviewed artikelen van de Nederlandse universiteiten van publicatiejaar 2023, zoals aangeleverd door de instellingen vanuit hun CRIS-systemen. De data zijn aangevuld met bibliografische informatie uit Crossref, DOAJ, de ISSN registry, en Unpaywall.

In totaal zijn 50.115 unieke DOIs meegenomen in de analyse (dit is inclusief publicaties van de Universiteit van Humanistiek). Hiervan kon van 49.815 publicaties de OA-status vastgesteld worden.

Behalve informatie over Open Access types bevat de dataset ook informatie over:

de mate waarin artikelen een open licentie hebben en zo ja, welke licentie;

de timing van het moment van in green open access beschikbaar komen ten opzichte van de publicatiedatum (o.a. door embargo-termijnen);

de repositories waar green open access versies beschikbaar zijn.

Noot: De resultaten van deze centrale monitoring laten verschillen met de bestaande decentrale monitoring zien. Met name het aandeel OA via repositories is lager. Deels kunnen de verschillen worden verklaard uit de set artikelen die is gebruikt, deels uit de manier waarop de OA status is vastgesteld. Een uitgebreide bespreking van de verschillen tussen de bestaande decentrale monitoring en deze centrale monitoring is terug te vinden in paragraaf 4.1. van de projectrapportage.

-----------------------------------------

This dataset is associated with the report "Project Vernieuwing Open Access Monitoring - Report Phase 1 - Peer-Reviewed Articles" [in Dutch] (https://doi.org/10.5281/zenodo.15061685). The project's objectives were to establish a transparent and reproducible workflow for centralized open access monitoring of peer-reviewed articles of the Dutch universities, utilizing open data, with code and data that can be fully shared openly.

The dataset contains record-level information on peer-reviewed articles from Dutch universities for publication year 2023, as provided by the institutions from their CRIS systems. The data has been supplemented with bibliographic information from Crossref, DOAJ, the ISSN registry, and Unpaywall.

In total, 50,115 unique DOIs were included in the analysis (including publications from the University of Humanistic Studies). The OA status of 49,815 publications was determined.

In addition to information on Open Access types, the dataset also includes details on:

the extent to which articles have an open license, and if so, which license;

when articles became available in green open access relative to the publication date (e.g., due to embargo periods);

the repositories where green open access versions are available.

Note: The results of this central monitoring show differences compared to the existing decentralized monitoring. In particular, the share of OA via repositories is lower. Some of the differences can be explained by the set of articles used and the way in which OA status was determined. A detailed discussion of the differences between the existing decentralized monitoring and this central monitoring can be found in section 4.1 of the project report.
f
Data composition and characteristics of included studies.
figshare.com
xls
Updated Jun 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthew G. Crowson; Dana Moukheiber; Aldo Robles Arévalo; Barbara D. Lam; Sreekar Mantena; Aakanksha Rana; Deborah Goss; David W. Bates; Leo Anthony Celi (2023). Data composition and characteristics of included studies. [Dataset]. http://doi.org/10.1371/journal.pdig.0000033.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pdig.0000033.t002
Dataset updated
Jun 14, 2023
Dataset provided by
PLOS Digital Health
Authors
Matthew G. Crowson; Dana Moukheiber; Aldo Robles Arévalo; Barbara D. Lam; Sreekar Mantena; Aakanksha Rana; Deborah Goss; David W. Bates; Leo Anthony Celi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data composition and characteristics of included studies.
Autonomous Navigation, Dynamic Path and Work Flow Planning in Multi-Agent...
data.nasa.gov
application/rdfxml +5
Updated Jun 26, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). Autonomous Navigation, Dynamic Path and Work Flow Planning in Multi-Agent Robotic Swarms [Dataset]. https://data.nasa.gov/dataset/Autonomous-Navigation-Dynamic-Path-and-Work-Flow-P/y744-vwzx
Explore at:
json, csv, tsv, application/rssxml, xml, application/rdfxmlAvailable download formats
Dataset updated
Jun 26, 2018
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Kennedy Space Center has teamed up with the Biological Computation Lab at the University of New Mexico to create a swarm of small, low-cost, autonomous robots, called Swarmies, to be used as a ground-based research platform for in-situ resource utilization missions. The behavior of the robot swarm mimics the central-place foraging strategy of ants to find and collect resources in an unknown environment and return those resources to a central site. The swarm has no prior knowledge of the environment, uses trails as a simple indirect communication strategy, and evolves a set of behavioral parameters using a genetic algorithm. The evolution of the parameters allows the swarm to maximize collection rates while adapting to new environmental conditions and to various unknown resource distributions. The goal of this research is to add new in-situ resource utilization related behaviors to the genetic algorithm and to increase the autonomy of the system.

As humans push further beyond the grasp of earth, robotic missions in advance of human missions will play an increasingly important role. These robotic systems will find and collect valuable resources as part of an in-situ resource utilization strategy. They will need to be highly autonomous while maintaining high task performance levels.
Kennedy Space Center has teamed up with the Biological Computation Lab at the University of New Mexico to create a swarm of small, low-cost, autonomous robots, called Swarmies, to be used as a ground-based research platform for in-situ resource utilization missions. The behavior of the robot swarm mimics the central-place foraging strategy of ants to find and collect resources in an unknown environment and return those resources to a central site. The swarm has no prior knowledge of the environment, uses trails as a simple indirect communication strategy, and evolves a set of behavioral parameters using a genetic algorithm. The evolution of the parameters allows the swarm to maximize collection rates while adapting to new environmental conditions and to various unknown resource distributions.
The goal of this research is to add new in-situ resource utilization related behaviors to the genetic algorithm and to increase the autonomy of the system. Digital trails that are based on ant pheromone trails provide simple communication of resource locations between robots but they also provide some assistance with obstacle avoidance and navigation. Two newly evolved genetic algorithm parameters allow the robots to recharge their batteries autonomously without loss of robots due to insufficient charge. The genetic algorithm is able to optimize collection rates while also dealing with relatively high system error due to inexpensive and sub-optimal sensors onboard the robots. The distributed nature of the robotic swarm prevents a single point of failure and allows the system to operate even with the loss of one or more robots.
Off-planet applications of such a system include water-ice detection and mining, terrain mapping and habitat construction in advance of human explorers. Terrestrial applications include search-and-rescue, hazardous waste cleanup, land mine removal and infrastructure inspection and repair. The approach used in the software could also be adapted to search the Internet or to search large unknown data sets.
This project has helped demonstrate that in an obstacle laden environment, trails used as a simple indirect communication strategy can allow a swarm of small, low-cost robots to collect resources in an optimal manner when coupled with a genetic algorithm to evolve behaviors. This project has also helped demonstrate that an autonomous robot swarm can evolve battery charging behavior using a genetic algorithm to minimize or eliminate dead robots due to insufficient charge.
The project has been successful in meeting the original goals in simulated field trials and also in real robots. The genetic algorithm is able to evolve optimal behaviors allowing for efficient resource collection while coping with various obstacle arrangements and resource distributions. The genetic algorithm is also able to maintain a high level of fitness while incorporating autonomous recharging of the robots. It has also been demonstrated that this system is error tolerant, adaptable to different robot types, and is scalable in both environment size and numbers of robots.
All of these robot behaviors are performed in real-time with a small, low-cost onboard computer and small memory footprint. The robot platform is constructed using commercial off-the-shelf parts and 3D printed parts with a total cost of less than $1,500 per robot.
A secondary goal of this project was to extend the genetic algorithm to other new and commercially available robot platforms. Another secondary goal was to use open source software frameworks to help reduce barriers and allow future researchers to more easily utilize genetic algorithms for behavior evolution in a swarm of robots. The project has been successful in meeting both of these secondary goals.
This autonomous mobile robot system is a foundation for future research into the suitability of robot swarms and evolutionary algorithms for in-situ resource utilization missions. The low-cost of this type of system removes one of the barriers typically associated with swarm operation and research. Such research should prove valuable as humans explore harsh, remote, or inaccessible locations where teleoperation is required.
Provisioning (Prov)
catalog.data.gov
datahub.va.gov
+4more
Updated Apr 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Veterans Affairs (2022). Provisioning (Prov) [Dataset]. https://catalog.data.gov/dataset/provisioning-prov
Explore at:
Dataset updated
Apr 21, 2022
Dataset provided by
United States Department of Veterans Affairshttp://va.gov/
Description
User provisioning is the process of associating a digital identity with one or more resource access accounts, which may serve as records for user data and permissions. This may include the creation, modification, deletion, suspension, or restoration of such accounts and also synchronizing user data. Managing user accounts locally creates redundancy in collecting and managing user information. This may lead to inaccuracies and inconsistencies with user data that is stored in authoritative sources, and this may also create security vulnerabilities by maintaining accounts for terminated users. By leveraging Provisioning, VA applications are able to centrally manage user information, attributes, roles, and accounts using data from authoritative identity sources. Once a manually intensive and disjointed process, VAs Provisioning (Prov) service leverages automated and centralized workflows to enhance security and bolster efficiency. As a result, application administrators, who previously had to manually manage user administration, now have services available to automate the process. Provisioning assigns the Security Identifier (SecID) as a unique user identifier for integrated applications to use for user authorization and audit. SecID is a unique ID assigned to a user when they are added to the Provisioning system via an on-boarding event. SecID, once assigned, remains the same even if the user status with VA changes over time (i.e., a Veteran becomes a contractor and then later becomes an employee). SecID is the identifier used to correlate Provisioning records to MVIs integration control number (ICN), which is the unique person identifier. Provisioning automates the on-boarding and off-boarding flows. Applications leveraging Provisioning gain an added benefit of automated account removal (deprovisioning) during off-boarding. The following table lists the detailed functions offered by the Provisioning service. Additionally, the Provisioning service includes a Role Engineering and Compliance Tool (RECT) that can help applications: Conduct Role Analysis: Provides the ability to analyze current roles and permissions to rapidly build and deploy an enterprise role model Certify User Roles: Provides the ability to have access privileges reviewed and managed by designated reviewers
Huddle-MyUSAID Document Workspaces
catalog.data.gov
Updated Jun 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.usaid.gov (2024). Huddle-MyUSAID Document Workspaces [Dataset]. https://catalog.data.gov/dataset/huddle-myusaid-document-workspaces
Explore at:
Dataset updated
Jun 25, 2024
Dataset provided by
United States Agency for International Developmenthttps://usaid.gov/
Description
Huddle is a collaboration platform that provides USAID with a secure environment where the Agency can manage, organize and work together on all of their content. Huddle workspaces are central repositories for all types of files to be saved and accessed by USAID offices around the globe. In workspaces, USAID users will manage project tasks, files and workflows. Huddle's file management features enable users to upload multiple files from their desktop, create a folder structure and share files with their team members through the platform. Users can share and comment on files, and direct the comments to specific team members. When edits to a file are required, users will open the file in its native application directly in the platform, make changes, and a new version will be automatically saved to the workspace. The editing feature provides users with all of the familiar features and functionality of the native application without leaving Huddle. Files are locked when they are opened for editing so there is no confusion about which user has made changes to a version. All content stored on Huddle has access permission settings so USAID can ensure that the right documents are visible and being shared with the appropriate users.
H
CIROH: Enabling collaboration through data and model sharing with CUAHSI...
beta.hydroshare.org
hydroshare.org
+1more
zip
Updated Aug 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Tarboton; Jeffery S. Horsburgh; Shaowen Wang; Jordan Stuart Read; Irene Garousi-Nejad; Anthony M. Castronova; Clara Cogswell (2024). CIROH: Enabling collaboration through data and model sharing with CUAHSI HydroShare [Dataset]. http://doi.org/10.4211/hs.23b51cb99b5445a1af740a21c5acaecb
Explore at:
zip(2.0 MB)Available download formats
Unique identifier
https://doi.org/10.4211/hs.23b51cb99b5445a1af740a21c5acaecb
Dataset updated
Aug 26, 2024
Dataset provided by
HydroShare
Authors
David Tarboton; Jeffery S. Horsburgh; Shaowen Wang; Jordan Stuart Read; Irene Garousi-Nejad; Anthony M. Castronova; Clara Cogswell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Collaboration is central to CIROH (https://ciroh.ua.edu/). Advancing the knowledge needed to support research to operations in hydrology depends on collaboration around model and data sharing. It requires open data supporting the integration of information from multiple sources; easy to use, generally accessible, shareable computing; and working together as a team and community. The CUAHSI HydroShare platform was developed to advance water research by enabling communities of researchers to more easily and freely share digital products resulting from their research, not just the scientific publications summarizing a study, but the data, models and workflows used to produce the results, consistent with Findable, Accessible, Interoperable, and Reusable (FAIR) principles of present-day research. HydroShare supports and enables private (e.g., social science) and open data sharing, transparent workflows, and computational reproducibility, thereby improving reliability and trust in research findings. These are crucial as research is transferred into operations.

The goal of this project is to enhance the performance, reliability, usability, and scalability of HydroShare’s linkages with cloud storage and computational systems to fulfill CIROH’s community collaboration and linked computing needs and enable CIROH researchers to easily integrate and analyze national scale datasets required for their research using high-performance and cloud computing systems.

The objectives are to (1) enhance community data access; (2) establish interoperability with scalable computing; (3) demonstrate computational reproducibility; and (4) establish and grow a CIROH Community on HydroShare. Work under objective (1) will use community input to identify, prioritize, and establish easy to use access to multiple high-value community datasets. Work under objective (2) will establish or extend interfaces to high performance computing, leveraging tools for model input preparation such as the CUAHSI Domain Subsetter and I-GUIDE (the Institute for Geospatial Understanding through an enhanced Discovery Environment, https://iguide.illinois.edu). Work under objective (3) will establish and document CIROH community best practices for enhancing the reproducibility of high-performance computing and analysis workflows so that CIROH modeling workflows can be accessed, re-executed, and analyzed by multiple researchers. Work under objective (4) will establish a CIROH “Community” within the HydroShare repository to support collaboration around and sharing of CIROH research products.

Forecasting operations will benefit from the transparency of research products hosted in HydroShare and linked to computing platforms for reproducibility and evaluation. Linking publications, data, and code (often in GitHub), with methods and findings that are well documented and tested will support their evaluation by the National Water Center for operational adoption.

This project runs 6/1/2023 to 5/31/2025.
c
HAND in Hand platform - FAO catalog - Sites - CKAN Ecosystem Catalog
catalog.civicdataecosystem.org
Updated May 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). HAND in Hand platform - FAO catalog - Sites - CKAN Ecosystem Catalog [Dataset]. https://catalog.civicdataecosystem.org/dataset/hand-in-hand-platform-fao-catalog
Explore at:
Dataset updated
May 5, 2025
Description
This portal is the backbone of the FAO Agro-informatics Platform. The portal provides tools to manage collections of data with metadata published by units across multiple domains within FAO and outside including Partners, NGOs, private sector and space agencies. This portal helps users discover, understand and effectively use data assets. The FAO Agro-informatics Platform is a transformative digital initiative designed to accelerate progress towards the Sustainable Development Goals (SDGs) by providing access to the latest knowledge and resources. By integrating geospatial and statistical data, the platform facilitates targeted agricultural interventions, converting data into actionable information for sustainable agriculture. This approach not only supports various SDGs - such as poverty eradication, climate action, and hunger elimination - but also enhances the mobilization of finance, science, and policy mechanisms. Notable initiatives like the Hand-in-Hand Initiative benefit from this platform, improving agrifood systems and resilience, especially for vulnerable rural communities and smallholders. The platform has brought together over 20 FAO units across multiple domains, from Animal Health to Trade and Markets, integrating data from across FAO on Soil, Land, Water, Climate, Fisheries, Livestock, Crops, Forestry, Trade, Social and Economics, etc. Data has also been sourced from FAO partners and public data providers across the UN and NGOs, private sector and space agencies. So far, we have assembled a million geospatial layers, thousands of statistics series with 4,000 metadata records. Find some links to the documentation for the Data Catalog Portal and the Agro-informatics Platform below: Find some links to the videos for the Agro-informatics Platform below: Support team Email: [email protected] The FAO Agro-Informatics Data Catalog portal is a customized data management platform that allows you to view, create, manage, publish and share datasets. It serves as a central repository for open data and supports a variety of data formats and metadata used by the Agro-informatics Platform. Datasets are what users get from the catalog when searching for data, each of them identified by a unique fixed URL. A dataset is a parcel of data that includes information about the data or “metadata”, and the related resource, which holds the data itself. A dataset can contain multiple resources, representing the same data in different formats or for different years. Data can be added as an upload from your computer or provided as a link to external sources on the web. An organization, in the FAO Agro-Informatics Data Catalog, is used to create, manage, and publish collections of datasets and to manage users. Organizations can be considered as entities that publish datasets. These entities may be a research institution either National or International, public and private sectors among others. Each organization is responsible for its own datasets and users to manage the dataset publication workflow. Users can have different roles within an organization, depending on their level of authorization to create, edit and publish information. Only system administrators can create a new organization in the FAO Agro-Informatics Data Catalog. Reach out to the team to assist you. You can register from the FAO Data Catalog portal by clicking on ‘Log-in’, on the top-right corner of the homepage. The registration process requests you to provide your email address (in lowercase) and your full name and password. Once the registration is complete, the platform will create your account and automatically log you in. Note that, registering to the platform does not automatically give you any special permission or authorization, which are indeed assigned by the admin of an organization. Learn more registration, login and password recovery here To be part of one or more organizations, you first need to be a registered user and then make the request to the contact person as indicated on the about page of the organization. You will be assigned to the organization with a role that reflects the level of authorization you need. Most of the datasets published in the FAO Agro-Informatics Data Catalog are publicly available. However, registered users assigned to one or more organizations can have access to private or unpublished datasets and may be assigned roles to create, manage and publish new datasets within their own organization(s). To share your data through the FAO Data Catalog portal, you need to be a registered user and added to an organization as an Editor or Admin. Dataset and DCAT metadata models are mainly used to describe statistical data (tables and spreadsheets) or to report collections of datasets and groups of layers. The ISO-19115 metadata model is the most appropriate schema to describe spatial data, particularly for raster and vector formats, and to link any type of associated resources to the metadata generated in the catalog. See the documentation for more information Yes, most datasets can be downloaded directly from their dataset page under Data and Resources. Click on the download link next to the resource you wish to download. If a dataset is restricted, you may need appropriate permissions or approval to access it. Data usage is typically governed by the license under which it is published. Each dataset page includes information about its license. Make sure to review the license to understand any restrictions or requirements for using the data. If you have questions about licensing, contact the dataset owner or our support team. If you encounter a bug or issue, please contact our support team ([email protected]) with detailed information about the problem. Providing screenshots and error messages can help us resolve the issue faster. © FAO 2024. All Rights Reserved.
f
Federated learning approach, topologies, and reproducibility of included...
plos.figshare.com
xls
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthew G. Crowson; Dana Moukheiber; Aldo Robles Arévalo; Barbara D. Lam; Sreekar Mantena; Aakanksha Rana; Deborah Goss; David W. Bates; Leo Anthony Celi (2023). Federated learning approach, topologies, and reproducibility of included studies. [Dataset]. http://doi.org/10.1371/journal.pdig.0000033.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pdig.0000033.t004
Dataset updated
Jun 4, 2023
Dataset provided by
PLOS Digital Health
Authors
Matthew G. Crowson; Dana Moukheiber; Aldo Robles Arévalo; Barbara D. Lam; Sreekar Mantena; Aakanksha Rana; Deborah Goss; David W. Bates; Leo Anthony Celi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Workflow, topologies and computing plan classification adapted from Rieke et al. [1].
e
Data from: Frugivoria: A trait database for birds and mammals exhibiting...
portal.edirepository.org
dataone.org
bin, csv, png
Updated Mar 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Beth Gerstner; Phoebe Zarnetske; Patrick Bills (2023). Frugivoria: A trait database for birds and mammals exhibiting frugivory across contiguous Neotropical moist forests [Dataset]. http://doi.org/10.6073/pasta/168e95f04d4726d31d868bfe22d749a5
Explore at:
csv(37073 byte), csv(16108 byte), csv(260284 byte), csv(66396 byte), csv(554499 byte), png(230027 byte), csv(582260 byte), csv(36616 byte), csv(25813 byte), csv(13411 byte), csv(11296 byte), csv(422716 byte), bin(9593 byte), bin(6371 byte), bin(1014 byte), bin(18818 byte), bin(16446 byte), bin(13210 byte), bin(4644 byte), bin(2775 byte), csv(539474 byte), bin(1471 byte), bin(4829 byte), bin(7869 byte), bin(5305 byte), bin(8772 byte)Available download formats
Unique identifier
https://doi.org/10.6073/pasta/168e95f04d4726d31d868bfe22d749a5
Dataset updated
Mar 21, 2023
Dataset provided by
EDI
Authors
Beth Gerstner; Phoebe Zarnetske; Patrick Bills
Time period covered
1924 - 2023
Area covered

Variables measured
Unit, code, rank, Trait, genus, family, season, habitat, order_e, species, and 164 more
Description
Biodiversity in many areas is rapidly shifting and declining as a consequence of global change. As such, there is an urgent need for new tools and strategies to help identify, monitor, and conserve biodiversity hotspots. One way to identify these areas is by quantifying functional diversity, which measures the unique roles of species within a community and is valuable for conservation because of its relationship with ecosystem functioning. Unfortunately, the trait information required to evaluate functional diversity is often lacking and is difficult to harmonize across disparate data sources. Biodiversity hotspots are particularly lacking in this information. To address this knowledge gap, we compiled Frugivoria, a trait database containing dietary, life-history, morphological, and geographic traits, for mammals and birds exhibiting frugivory, which are important for seed dispersal, an essential ecosystem service. Accompanying Frugivoria is an open workflow that harmonizes trait and taxonomic data from disparate sources and enables users to analyze traits in space. This version of Frugivoria contains mammal and bird species found in contiguous moist montane forests and adjacent moist lowland forests of Central and South America– the latter specifically focusing on the Andean states. In total, Frugivoria includes 45,216 unique trait values, including new values and harmonized values from existing databases. Frugivoria adds 23,707 new trait values (8,709 for mammals and 14,999 for birds) for a total of 1,733 bird and mammal species. These traits include diet breadth, habitat breadth, habitat specialization, body size, sexual dimorphism, and range-based geographic traits including range size, average annual mean temperature and precipitation, and metrics of human impact calculated over the range. Frugivoria fills gaps in trait categories from other databases such as diet category, home range size, generation time, and longevity, and extends certain traits, once only available for mammals, to birds. In addition, Frugivoria adds newly described species not included in other databases and harmonizes species classifications among databases. Frugivoria and its workflow enable researchers to quantify relationships between traits and the environment, as well as spatial trends in functional diversity, contributing to basic knowledge and applied conservation of frugivores in this region. By harmonizing trait information from disparate sources and providing code to access species occurrence data, this open-access database fills a major knowledge gap and enables more comprehensive trait-based studies of species exhibiting frugivory in this ecologically important region.
d
A workflow for selecting seed densities in desert species experiments: pilot...
dataone.org
knb.ecoinformatics.org
Updated Apr 10, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christopher Lortie; Mario Zuliani; Nargol Ghazian (2021). A workflow for selecting seed densities in desert species experiments: pilot test for Central California Deserts [Dataset]. http://doi.org/10.5063/QF8R85
Explore at:
Unique identifier
https://doi.org/10.5063/QF8R85
Dataset updated
Apr 10, 2021
Dataset provided by
Knowledge Network for Biocomplexity
Authors
Christopher Lortie; Mario Zuliani; Nargol Ghazian
Time period covered
Jan 1, 2021
Area covered

Description
A data-check of the reported seed densities for Central California species (native and exotic) in experimentation.
f
Enhancing UNCDF Operations: Power BI Dashboard Development and Data Mapping
figshare.com
Updated Jan 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maryam Binti Haji Abdul Halim (2025). Enhancing UNCDF Operations: Power BI Dashboard Development and Data Mapping [Dataset]. http://doi.org/10.6084/m9.figshare.28147451.v1
Explore at:
Unique identifier
https://doi.org/10.6084/m9.figshare.28147451.v1
Dataset updated
Jan 6, 2025
Dataset provided by
figshare
Authors
Maryam Binti Haji Abdul Halim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This project focuses on data mapping, integration, and analysis to support the development and enhancement of six UNCDF operational applications: OrgTraveler, Comms Central, Internal Support Hub, Partnership 360, SmartHR, and TimeTrack. These apps streamline workflows for travel claims, internal support, partnership management, and time tracking within UNCDF.Key Features and Tools:Data Mapping for Salesforce CRM Migration: Structured and mapped data flows to ensure compatibility and seamless migration to Salesforce CRM.Python for Data Cleaning and Transformation: Utilized pandas, numpy, and APIs to clean, preprocess, and transform raw datasets into standardized formats.Power BI Dashboards: Designed interactive dashboards to visualize workflows and monitor performance metrics for decision-making.Collaboration Across Platforms: Integrated Google Collab for code collaboration and Microsoft Excel for data validation and analysis.
f
Data from: Fragment Shuffling: An Automated Workflow for Three-Dimensional...
figshare.com
acs.figshare.com
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Britta Nisius; Ulrich Rester (2023). Fragment Shuffling: An Automated Workflow for Three-Dimensional Fragment-Based Ligand Design [Dataset]. http://doi.org/10.1021/ci8004572.s007
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1021/ci8004572.s007
Dataset updated
Jun 1, 2023
Dataset provided by
ACS Publications
Authors
Britta Nisius; Ulrich Rester
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Fragment-based approaches display a promising alternative in lead discovery. Herein, we present the automated fragment shuffling workflow for the identification of novel lead compounds combining central elements from fragment-based lead identification and structure-based de novo design. Our method is based on sets of aligned 3D ligand structures binding to the same target or target family. The implementation comprises three different ligand fragmentation methods, a scoring scheme assigning individual scores to each fragment, and the incremental construction of novel ligands based on a greedy search algorithm guided by the calculated fragment scores. The validation of our 3D ligand design workflow is presented on the basis of two pharmaceutically relevant drug targets. A retrospective study based on a selected protein kinase data set revealed that the fragment shuffling approach realizes extended results compared to the well-known BREED technique. Furthermore, we applied our approach in a prospective study for the design of novel non-peptidic thrombin inhibitors. The designed ligand structures in both studies demonstrate the potential of the fragment shuffling workflow.

Facebook

Twitter

Click to copy link

Link copied

Cite

Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen (2024). Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications [Dataset]. http://doi.org/10.5281/zenodo.8226725

Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

zip, pdfAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.8226725

Dataset updated

Jul 11, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

This repository contains the dataset for the study of computational reproducibility of Jupyter notebooks from biomedical publications. Our focus lies in evaluating the extent of reproducibility of Jupyter notebooks derived from GitHub repositories linked to publications present in the biomedical literature repository, PubMed Central. We analyzed the reproducibility of Jupyter notebooks from GitHub repositories associated with publications indexed in the biomedical literature repository PubMed Central. The dataset includes the metadata information of the journals, publications, the Github repositories mentioned in the publications and the notebooks present in the Github repositories.

Data Collection and Analysis

We use the code for reproducibility of Jupyter notebooks from the study done by Pimentel et al., 2019 and adapted the code from ReproduceMeGit. We provide code for collecting the publication metadata from PubMed Central using NCBI Entrez utilities via Biopython.

Our approach involves searching PMC using the esearch function for Jupyter notebooks using the query: ``(ipynb OR jupyter OR ipython) AND github''. We meticulously retrieve data in XML format, capturing essential details about journals and articles. By systematically scanning the entire article, encompassing the abstract, body, data availability statement, and supplementary materials, we extract GitHub links. Additionally, we mine repositories for key information such as dependency declarations found in files like requirements.txt, setup.py, and pipfile. Leveraging the GitHub API, we enrich our data by incorporating repository creation dates, update histories, pushes, and programming languages.

All the extracted information is stored in a SQLite database. After collecting and creating the database tables, we ran a pipeline to collect the Jupyter notebooks contained in the GitHub repositories based on the code from Pimentel et al., 2019.

Our reproducibility pipeline was started on 27 March 2023.

Repository Structure

Our repository is organized into two main folders:

archaeology: This directory hosts scripts designed to download, parse, and extract metadata from PubMed Central publications and associated repositories. There are 24 database tables created which store the information on articles, journals, authors, repositories, notebooks, cells, modules, executions, etc. in the db.sqlite database file.
analyses: Here, you will find notebooks instrumental in the in-depth analysis of data related to our study. The db.sqlite file generated by running the archaelogy folder is stored in the analyses folder for further analysis. The path can however be configured in the config.py file. There are two sets of notebooks: one set (naming pattern N[0-9]*.ipynb) is focused on examining data pertaining to repositories and notebooks, while the other set (PMC[0-9]*.ipynb) is for analyzing data associated with publications in PubMed Central, i.e.\ for plots involving data about articles, journals, publication dates or research fields. The resultant figures from the these notebooks are stored in the 'outputs' folder.
MethodsWorkflow: The MethodsWorkflow file provides a conceptual overview of the workflow used in this study.

Accessing Data and Resources:

All the data generated during the initial study can be accessed at https://doi.org/10.5281/zenodo.6802158
For the latest results and re-run data, refer to this link.
The comprehensive SQLite database that encapsulates all the study's extracted data is stored in the db.sqlite file.
The metadata in xml format extracted from PubMed Central which contains the information about the articles and journal can be accessed in pmc.xml file.

System Requirements:

Centos 7 (Documentation: https://www.centos.org/)
Conda 4.9.4 (Installation Guide: https://docs.anaconda.com/anaconda/install/linux/)
Python 3.7.6 (Download Link: https://www.python.org/downloads/)
GitHub account (Get Started: https://github.com/, Requires GitHub Username and Token)
gcc 7.3.0 (Installation Guide: https://gcc.gnu.org/install/)
lbzip2 (Command: `conda install -c conda-forge lbzip2')

Running the pipeline:

Clone the computational-reproducibility-pmc repository using Git:
git clone https://github.com/fusion-jena/computational-reproducibility-pmc.git
Navigate to the computational-reproducibility-pmc directory:
cd computational-reproducibility-pmc/computational-reproducibility-pmc
Configure environment variables in the config.py file:
GITHUB_USERNAME = os.environ.get("JUP_GITHUB_USERNAME", "add your github username here")
GITHUB_TOKEN = os.environ.get("JUP_GITHUB_PASSWORD", "add your github token here")
Other environment variables can also be set in the config.py file.
BASE_DIR = Path(os.environ.get("JUP_BASE_DIR", "./")).expanduser() # Add the path of directory where the GitHub repositories will be saved
DB_CONNECTION = os.environ.get("JUP_DB_CONNECTION", "sqlite:///db.sqlite") # Add the path where the database is stored.
To set up conda environments for each python versions, upgrade pip, install pipenv, and install the archaeology package in each environment, execute:
source conda-setup.sh
Change to the archaeology directory
cd archaeology
Activate conda environment. We used py36 to run the pipeline.
conda activate py36
Execute the main pipeline script (r0_main.py):
python r0_main.py

Running the analysis:

Navigate to the analysis directory.
cd analyses
Activate conda environment. We use raw38 for the analysis of the metadata collected in the study.
conda activate raw38
Install the required packages using the requirements.txt file.
pip install -r requirements.txt
Launch Jupyterlab
jupyter lab
Refer to the Index.ipynb notebook for the execution order and guidance.

References:

Sheeba Samuel, Daniel Mietchen. (2024). Computational reproducibility of Jupyter notebooks from biomedical publications, https://doi.org/10.1093/gigascience/giad113, GigaScience
Sheeba Samuel, Daniel Mietchen. (2022). Computational reproducibility of Jupyter notebooks from biomedical publications, https://arxiv.org/pdf/2209.04308.pdf, CoRR abs/2209.04308
Sheeba Samuel, & Daniel Mietchen. (2022). Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6802158

Clear search

Close search

Google apps

Main menu

Dataset of a Study of Computational reproducibility of Jupyter notebooks...

Agreement Workflow Tool

HZDR Data Management Strategy — Top-Level Architecture

ckanext-rems - Extensions - CKAN Ecosystem Catalog

ckanext-dsactions

Data Management Project for Collaborative Groundwater Research

Replication Data for: \"Analyzing Reward Dynamics and Decentralization in...

Brown-Algae.dataset

Project vernieuwing open access monitoring - peer-reviewed artikelen...

Data composition and characteristics of included studies.

Autonomous Navigation, Dynamic Path and Work Flow Planning in Multi-Agent...

Provisioning (Prov)

Huddle-MyUSAID Document Workspaces

CIROH: Enabling collaboration through data and model sharing with CUAHSI...

HAND in Hand platform - FAO catalog - Sites - CKAN Ecosystem Catalog

Federated learning approach, topologies, and reproducibility of included...

Data from: Frugivoria: A trait database for birds and mammals exhibiting...

A workflow for selecting seed densities in desert species experiments: pilot...

Enhancing UNCDF Operations: Power BI Dashboard Development and Data Mapping

Data from: Fragment Shuffling: An Automated Workflow for Three-Dimensional...

Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publicationsSee More Versions

Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications