15 datasets found

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

zenodo.org

application/gzip, bin +2

Updated Aug 2, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788

Explore at:

bin, application/gzip, zip, text/x-pythonAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.1419788

Dataset updated

Aug 2, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb

License

https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

Description

Replication pack, FSE2018 submission #164:
------------------------------------------

**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
A Case Study of the PyPI Ecosystem

**Note:** link to data artifacts is already included in the paper. 
Link to the code will be included in the Camera Ready version as well.


Content description
===================

- **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
 described below
- **settings.py** - settings template for the code archive.
- **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
 This dataset only includes stats aggregated by the ecosystem (PyPI)
- **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
 statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
 themselves, which take around 2TB.
- **build_model.r, helpers.r** - R files to process the survival data 
  (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
  `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
  **dataset_full_Jan_2018.tgz**)
- **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
- LICENSE - text of GPL v3, under which this dataset is published
- INSTALL.md - replication guide (~2 pages)

Replication guide
=================

Step 0 - prerequisites
----------------------

- Unix-compatible OS (Linux or OS X)
- Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
- R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)

Depending on detalization level (see Step 2 for more details):
- up to 2Tb of disk space (see Step 2 detalization levels)
- at least 16Gb of RAM (64 preferable)
- few hours to few month of processing time

Step 1 - software
----------------

- unpack **ghd-0.1.0.zip**, or clone from gitlab:

   git clone https://gitlab.com/user2589/ghd.git
   git checkout 0.1.0
 
 `cd` into the extracted folder. 
 All commands below assume it as a current directory.
  
- copy `settings.py` into the extracted folder. Edit the file:
  * set `DATASET_PATH` to some newly created folder path
  * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
- install docker. For Ubuntu Linux, the command is 
  `sudo apt-get install docker-compose`
- install libarchive and headers: `sudo apt-get install libarchive-dev`
- (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
 Without this dependency, you might get an error on the next step, 
 but it's safe to ignore.
- install Python libraries: `pip install --user -r requirements.txt` . 
- disable all APIs except GitHub (Bitbucket and Gitlab support were
 not yet implemented when this study was in progress): edit
 `scraper/init.py`, comment out everything except GitHub support
 in `PROVIDERS`.

Step 2 - obtaining the dataset
-----------------------------

The ultimate goal of this step is to get output of the Python function 
`common.utils.survival_data()` and save it into a CSV file:

  # copy and paste into a Python console
  from common import utils
  survival_data = utils.survival_data('pypi', '2008', smoothing=6)
  survival_data.to_csv('survival_data.csv')

Since full replication will take several months, here are some ways to speedup
the process:

####Option 2.a, difficulty level: easiest

Just use the precomputed data. Step 1 is not necessary under this scenario.

- extract **dataset_minimal_Jan_2018.zip**
- get `survival_data.csv`, go to the next step

####Option 2.b, difficulty level: easy

Use precomputed longitudinal feature values to build the final table.
The whole process will take 15..30 minutes.

- create a folder `

Z
Dataset for Assessing Multi-Dimensional Impacts of Achieving Sustainability...
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Mar 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kyle, Page; Ollenburger, Mary; Zhang, Xin; Niazi, Hassan; Durga, Siddarth; Ou, Yang (2024). Dataset for Assessing Multi-Dimensional Impacts of Achieving Sustainability Goals by Projecting the Sustainable Agriculture Matrix into the Future [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7251098
Explore at:
Dataset updated
Mar 17, 2024
Dataset provided by
Joint Global Change Research Institute, Pacific Northwest National Laboratory (JGCRI-PNNL), College Park, MD, USA
Appalachian Laboratory, University of Maryland Center for Environmental Science (UMCES), Frostburg, MD, USA
Authors
Kyle, Page; Ollenburger, Mary; Zhang, Xin; Niazi, Hassan; Durga, Siddarth; Ou, Yang
License
https://opensource.org/licenses/BSD-2-Clausehttps://opensource.org/licenses/BSD-2-Clause
Description
This data repository feeds into the meta-repository setup for post-processing of GCAM-SAM outputs. GitHub link of meta-repository is: https://github.com/JGCRI/Kyle-etal_2022_EF

Folders: model/ is the static version of the model used to simulate 8 scenarios. See the GitHub GCAM-SAM repository to follow active development of this model. inputs/ folder contains input datasets and scripts used to prepare files while postprocessing. This is to be used with GitHub post-processing meta-repository. outdata/ contains GCAM-SAM output and post-processed output files used to plot figures.

Key files: SAM-matrix.dat is the consolidated GCAM-SAM output. Use proj_load.R in the metarepo to read the file. region_vals.csv has all 8 indicators in all 8 scenarios for years 2020 till 2100 on a 10 year time step.

Short introduction to the study:

In this paper sustainable agriculture matrix (SAM) is estimated to 2100 using Global Change Analysis Model (GCAM). We model combinatorial variations of yield intensification, dietary shift, and greenhouse gas mitigation scenarios. Findings include scenarios having significant tradeoffs across multiple environmental, economic, and social dimensions. Assessment of these multi-dimensional tradeoffs in a consistent framework improves the quality of information for decision-making.

Should you have any questions, feel free to reach out Page Kyle at pkyle@pnnl.gov.
Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...
data.niaid.nih.gov
zenodo.org
+1more
zip
Updated Dec 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dylan Westfall; Mullins James (2023). Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies [Dataset]. http://doi.org/10.5061/dryad.w3r2280w0
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.w3r2280w0
Dataset updated
Dec 7, 2023
Dataset provided by
National Institute of Allergy and Infectious Diseaseshttp://www.niaid.nih.gov/
HIV Prevention Trials Networkhttp://www.hptn.org/
HIV Vaccine Trials Networkhttp://www.hvtn.org/
PEPFAR
Authors
Dylan Westfall; Mullins James
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies. Methods This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies" Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005 For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub. The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub. The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results. Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program. To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper. Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd. Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.

Dollar-Rial-Toman Live Price Dataset

kaggle.com

zip

Updated Nov 7, 2025

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Koorosh Komeilizadeh (2025). Dollar-Rial-Toman Live Price Dataset [Dataset]. https://www.kaggle.com/datasets/kooroshkz/dollar-rial-toman-live-price-dataset

Explore at:

zip(66708 bytes)Available download formats

Dataset updated

Nov 7, 2025

Authors

Koorosh Komeilizadeh

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dollar-Rial-Toman Live Price Dataset

A comprehensive, daily-updated dataset of US Dollar to Iranian Rial exchange rates (USD/IRR) with historical data from November 2011 to present. This dataset is ideal for financial analysis, economic research, forecasting, and machine learning projects.

Dataset Overview

Time Period: November 26, 2011 - Present (continuously updated)
Total Records: 3,648+ daily price points
Data Source: TGJU.org (Tehran Gold & Jewelry Union)
Update Frequency: Daily (automated via GitHub Actions)
Format: CSV with proper date formatting and integer price structure

Data Structure

The CSV file contains the following columns:

Column	Description	Format	Example
Open Price	Opening price of the day	Integer	1012100
Low Price	Lowest price of the day	Integer	1011700
High Price	Highest price of the day	Integer	1034100
Close Price	Closing price of the day	Integer	1029800
Change Amount	Price change amount	String	15400
Change Percent	Price change percentage	String	1.52%
Gregorian Date	Gregorian date	YYYY/MM/DD	2025/09/06
Persian Date	Persian/Shamsi date	YYYY/MM/DD	1404/06/15

Download the Data

Dataset on Kaggle: Kaggle: kooroshkz/Dollar-Rial-Toman-Live-Price-Dataset
Dataset on Github: GitHub: kooroshkz/Dollar-Rial-Toman-Live-Price-Dataset

View Scraper and workflow source on GitHub

This live dataset, scraper source code and workflow is available on GitHub where you can explore, download, and use it directly.

Documentation & Charts

"https://kooroshkz.github.io/Dollar-Rial-Toman-Live-Price-Dataset/" target="_blank"> https://raw.githubusercontent.com/kooroshkz/Dollar-Rial-Toman-Live-Price-Dataset/main/assets/img/IntractiveChart.png">

Interactive charts and dataset overview are available at:
kooroshkz.github.io/Dollar-Rial-Toman-Live-Price-Dataset

Loading in Python

import pandas as pd

# Load dataset
df = pd.read_csv('data/Dollar_Rial_Price_Dataset.csv')

# Convert date column to datetime
df['Gregorian Date'] = pd.to_datetime(df['Gregorian Date'], format='%Y/%m/%d')

# Price columns are already integers
price_columns = ['Open Price', 'Low Price', 'High Price', 'Close Price']
print(df[price_columns].dtypes) # All should be int64

Direct Load in Python

# pip install kagglehub[hf-datasets]
import kagglehub

df = kagglehub.load_dataset(
  "kooroshkz/dollar-rial-toman-live-price-dataset",
  adapter="huggingface",
  file_path="Dollar_Rial_Price_Dataset.csv",
  pandas_kwargs={"parse_dates": ["Gregorian Date"]}
)

print(df.head())

Loading in R

# Load dataset
data <- read.csv("data/Dollar_Rial_Price_Dataset.csv", stringsAsFactors = FALSE)

# Convert date column
data$Gregorian.Date <- as.Date(data$Gregorian.Date, format = "%Y/%m/%d")

# View structure
str(data)

Data Quality & Updates

Validation: All price data undergoes validation checks for accuracy
Automated Updates: Dataset is automatically updated daily at 8:00 AM UTC
Data Integrity: Built-in duplicate prevention and format validation
Historical Consistency: Maintains consistent formatting across all time periods
Integer Prices: All price values stored as integers for precise calculations

Technical Implementation

This dataset is maintained using an automated web scraping system that:

Monitors TGJU.org for new exchange rate data
Validates and processes new records
Maintains data consistency and prevents duplicates
Automatically commits updates to the repository

Contributing

If you find data inconsistencies or have suggestions for improvements, please open an issue in the GitHub repository.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this dataset in your research or projects, please cite:

Dollar-Rial-Toman Live Price Dataset
Author: Koorosh Komeili Zadeh
Source: https://github.com/kooroshkz/Dollar-Rial-Toman-Live-Price-Dataset
Data Source: TGJU.org (Tehran Gold & Jewelry Union)
Date Range: November 2011 - Present

Keywords

USD to Rial dataset, Dollar to Toman dataset, Iran exchange rate CSV, USD/IRR daily price, foreign exchange Iran dataset, TGJU data, time series currency dataset

Disclaimer

This ...

The North Pacific Eukaryotic Gene Catalog: clustered nucleotide...
zenodo.org
application/gzip
Updated Jan 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryan Groussman; Ryan Groussman; Sacha Coesel; Sacha Coesel; E. Virginia Armbrust; E. Virginia Armbrust (2025). The North Pacific Eukaryotic Gene Catalog: clustered nucleotide metatranscripts and read counts [Dataset]. http://doi.org/10.5281/zenodo.13826820
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13826820
Dataset updated
Jan 22, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ryan Groussman; Ryan Groussman; Sacha Coesel; Sacha Coesel; E. Virginia Armbrust; E. Virginia Armbrust
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data continues with the development of the NPEGC Trinity de novo metatranscriptome assemblies from the protein data repository of The North Pacific Eukaryotic Gene Catalog. The nucleotide sequences corresponding to the NPEGC cluster representatives are collected together in these repository files:

NPac.G1PA.bf100.id99.nt.fasta.gz
NPac.G2PA.bf100.id99.nt.fasta.gz
NPac.G3PA.bf100.id99.nt.fasta.gz
NPac.G3PA_diel.bf100.id99.nt.fasta.gz
NPac.D1PA.bf100.id99.nt.fasta.gz

A full description of this data is published in Scientific Data, available here: The North Pacific Eukaryotic Gene Catalog of metatranscriptome assemblies and annotations. Please cite this publication if your research uses this data:

Groussman, R. D., Coesel, S. N., Durham, B. P., Schatz, M. J., & Armbrust, E. V. (2024). The North Pacific Eukaryotic Gene Catalog of metatranscriptome assemblies and annotations. Scientific Data, 11(1), 1161.

These nucleotide sequences have been sourced from the Zenodo repository for raw assemblies: The North Pacific Eukaryotic Gene Catalog: Raw assemblies from Gradients 1, 2 and 3

Key processing steps are sampled below with links to the detailed code on the main github code repository: https://github.com/armbrustlab/NPac_euk_gene_catalog

Code used to build the kallisto indices and map the short reads against indices with kallisto are online in the code repository here: NPEGC.nt_kallisto_counts.sh

There are two main steps:
1. Generate the kallisto index on the sets of clustered nucleotide metatranscripts
2. Map the short reads from environmental samples back to the assembly index

As generated above, kallisto generates separate results files for each of the sample files. Even after compression, the total size of the tarballed kallisto output results directories are prohibitively large (>50GB). We use the code in this template R script to join together the 'est_count' estimated count values for the tens of millions of protein sequences in each project metatranscriptome, along with length.

The code in this template script was used for each project: aggregate_kallisto_counts.R
The output count files for each project are Gzip-compressed and uploaded to the NPEGC nucleotide data repository here:

G1PA.raw.est_counts.csv.gz
G2PA.raw.est_counts.csv.gz
G3PA.raw.est_counts.csv.gz
G3PA_diel.raw.est_counts.csv.gz
D1PA.raw.est_counts.csv.gz
n
Data from: Generalizable EHR-R-REDCap pipeline for a national...
data.niaid.nih.gov
datadryad.org
zip
Updated Jan 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.rjdfn2zcm
Dataset updated
Jan 9, 2022
Dataset provided by
Harvard Medical School
Massachusetts General Hospital
Authors
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.

Methods eLAB Development and Source Code (R statistical software):

eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).

eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.

Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.

The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).

Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.

Data Dictionary (DD)

EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.

Study Cohort

This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.

Statistical Analysis

OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.
d
Data from: Reference transcriptomics of porcine peripheral immune cells...
catalog.data.gov
agdatacommons.nal.usda.gov
+2more
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Data from: Reference transcriptomics of porcine peripheral immune cells created through bulk and single-cell RNA sequencing [Dataset]. https://catalog.data.gov/dataset/data-from-reference-transcriptomics-of-porcine-peripheral-immune-cells-created-through-bul-e667c
Explore at:
Dataset updated
Jun 5, 2025
Dataset provided by
Agricultural Research Service
Description
This dataset contains files reconstructing single-cell data presented in 'Reference transcriptomics of porcine peripheral immune cells created through bulk and single-cell RNA sequencing' by Herrera-Uribe & Wiarda et al. 2021. Samples of peripheral blood mononuclear cells (PBMCs) were collected from seven pigs and processed for single-cell RNA sequencing (scRNA-seq) in order to provide a reference annotation of porcine immune cell transcriptomics at enhanced, single-cell resolution. Analysis of single-cell data allowed identification of 36 cell clusters that were further classified into 13 cell types, including monocytes, dendritic cells, B cells, antibody-secreting cells, numerous populations of T cells, NK cells, and erythrocytes. Files may be used to reconstruct the data as presented in the manuscript, allowing for individual query by other users. Scripts for original data analysis are available at https://github.com/USDA-FSEPRU/PorcinePBMCs_bulkRNAseq_scRNAseq. Raw data are available at https://www.ebi.ac.uk/ena/browser/view/PRJEB43826. Funding for this dataset was also provided by NRSP8: National Animal Genome Research Program (https://www.nimss.org/projects/view/mrp/outline/18464). Resources in this dataset:Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells 10X Format. File Name: PBMC7_AllCells.zipResource Description: Zipped folder containing PBMC counts matrix, gene names, and cell IDs. Files are as follows: matrix of gene counts* (matrix.mtx.gx) gene names (features.tsv.gz) cell IDs (barcodes.tsv.gz) *The ‘raw’ count matrix is actually gene counts obtained following ambient RNA removal. During ambient RNA removal, we specified to calculate non-integer count estimations, so most gene counts are actually non-integer values in this matrix but should still be treated as raw/unnormalized data that requires further normalization/transformation. Data can be read into R using the function Read10X().Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells Metadata. File Name: PBMC7_AllCells_meta.csvResource Description: .csv file containing metadata for cells included in the final dataset. Metadata columns include: nCount_RNA = the number of transcripts detected in a cell nFeature_RNA = the number of genes detected in a cell Loupe = cell barcodes; correspond to the cell IDs found in the .h5Seurat and 10X formatted objects for all cells prcntMito = percent mitochondrial reads in a cell Scrublet = doublet probability score assigned to a cell seurat_clusters = cluster ID assigned to a cell PaperIDs = sample ID for a cell celltypes = cell type ID assigned to a cellResource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells PCA Coordinates. File Name: PBMC7_AllCells_PCAcoord.csvResource Description: .csv file containing first 100 PCA coordinates for cells. Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells t-SNE Coordinates. File Name: PBMC7_AllCells_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for all cells.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells UMAP Coordinates. File Name: PBMC7_AllCells_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for all cells.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells t-SNE Coordinates. File Name: PBMC7_CD4only_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells UMAP Coordinates. File Name: PBMC7_CD4only_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells UMAP Coordinates. File Name: PBMC7_GDonly_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells t-SNE Coordinates. File Name: PBMC7_GDonly_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gene Annotation Information. File Name: UnfilteredGeneInfo.txtResource Description: .txt file containing gene nomenclature information used to assign gene names in the dataset. 'Name' column corresponds to the name assigned to a feature in the dataset.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells H5Seurat. File Name: PBMC7.tarResource Description: .h5Seurat object of all cells in PBMC dataset. File needs to be untarred, then read into R using function LoadH5Seurat().
Z
Food and Agriculture Biomass Input–Output (FABIO) database
data.niaid.nih.gov
zenodo.org
+1more
Updated Jun 8, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bruckner, Martin; Kuschnig, Nikolas (2022). Food and Agriculture Biomass Input–Output (FABIO) database [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2577066
Explore at:
Dataset updated
Jun 8, 2022
Dataset provided by
Vienna University of Economics and Business
Authors
Bruckner, Martin; Kuschnig, Nikolas
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This data repository provides the Food and Agriculture Biomass Input Output (FABIO) database, a global set of multi-regional physical supply-use and input-output tables covering global agriculture and forestry.

The work is based on mostly freely available data from FAOSTAT, IEA, EIA, and UN Comtrade/BACI. FABIO currently covers 191 countries + RoW, 118 processes and 125 commodities (raw and processed agricultural and food products) for 1986-2013. All R codes and auxilliary data are available on GitHub. For more information please refer to https://fabio.fineprint.global.

The database consists of the following main components, in compressed .rds format:

Z: the inter-commodity input-output matrix, displaying the relationships of intermediate use of each commodity in the production of each commodity, in physical units (tons). The matrix has 24000 rows and columns (125 commodities x 192 regions), and is available in two versions, based on the method to allocate inputs to outputs in production processes: Z_mass (mass allocation) and Z_value (value allocation). Note that the row sums of the Z matrix (= total intermediate use by commodity) are identical in both versions.

Y: the final demand matrix, denoting the consumption of all 24000 commodities by destination country and final use category. There are six final use categories (yielding 192 x 6 = 1152 columns): 1) food use, 2) other use (non-food), 3) losses, 4) stock addition, 5) balancing, and 6) unspecified.

X: the total output vector of all 24000 commodities. Total output is equal to the sum of intermediate and final use by commodity.

L: the Leontief inverse, computed as (I – A)-1, where A is the matrix of input coefficients derived from Z and x. Again, there are two versions, depending on the underlying version of Z (L_mass and L_value).

E: environmental extensions for each of the 24000 commodities, including four resource categories: 1) primary biomass extraction (in tons), 2) land use (in hectares), 3) blue water use (in m3)., and 4) green water use (in m3).

mr_sup_mass/mr_sup_value: For each allocation method (mass/value), the supply table gives the physical supply quantity of each commodity by producing process, with processes in the rows (118 processes x 192 regions = 22656 rows) and commodities in columns (24000 columns).

mr_use: the use table capture the quantities of each commodity (rows) used as an input in each process (columns).

A description of the included countries and commodities (i.e. the rows and columns of the Z matrix) can be found in the auxiliary file io_codes.csv. Separate lists of the country sample (including ISO3 codes and continental grouping) and commodities (including moisture content) are given in the files regions.csv and items.csv, respectively. For information on the individual processes, see auxiliary file su_codes.csv. RDS files can be opened in R. Information on how to read these files can be obtained here: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/readRDS

Except of X.rds, which contains a matrix, all variables are organized as lists, where each element contains a sparse matrix. Please note that values are always given in physical units, i.e. tonnes or head, as specified in items.csv. The suffixes value and mass only indicate the form of allocation chosen for the construction of the symmetric IO tables (for more details see Bruckner et al. 2019). Product, process and country classifications can be found in the file fabio_classifications.xlsx.

Footprint results are not contained in the database but can be calculated, e.g. by using this script: https://github.com/martinbruckner/fabio_comparison/blob/master/R/fabio_footprints.R

How to cite:

To cite FABIO work please refer to this paper:

Bruckner, M., Wood, R., Moran, D., Kuschnig, N., Wieland, H., Maus, V., Börner, J. 2019. FABIO – The Construction of the Food and Agriculture Input–Output Model. Environmental Science & Technology 53(19), 11302–11312. DOI: 10.1021/acs.est.9b03554

License:

This data repository is distributed under the CC BY-NC-SA 4.0 License. You are free to share and adapt the material for non-commercial purposes using proper citation. If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. In case you are interested in a collaboration, I am happy to receive enquiries at martin.bruckner@wu.ac.at.

Known issues:

The underlying FAO data have been manipulated to the minimum extent necessary. Data filling and supply-use balancing, yet, required some adaptations. These are documented in the code and are also reflected in the balancing item in the final demand matrices. For a proper use of the database, I recommend to distribute the balancing item over all other uses proportionally and to do analyses with and without balancing to illustrate uncertainties.
Z
Data from: Species Portfolio Effects Dominate Seasonal Zooplankton...
data.niaid.nih.gov
Updated Mar 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
O'Connor, Reilly (2022). Species Portfolio Effects Dominate Seasonal Zooplankton Stabilization Within a Large Temperate Lake [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_6345004
Explore at:
Dataset updated
Mar 16, 2022
Dataset provided by
University of Guelph
Authors
O'Connor, Reilly
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The raw data file is available online for public access (https://data.ontario.ca/dataset/lake-simcoe-monitoring). Download the 1980-2019 csv files and open up the file named "Simcoe_Zooplankton&Bythotrephes.csv". Copy and paste the zooplankton sheet into a new excel file called "Simcoe_Zooplankton.csv". The column ZDATE in the excel file needs to be switched from GENERAL to SHORT DATE so that the dates in the ZDATE column read "YYYY/MM/DD". Save as .csv in appropriate R folder. The data file "simcoe_manual_subset_weeks_5" is the raw data that has been subset for the main analysis of the article using the .R file "Simcoe MS - 5 Station Subset Data". The .csv file produced from this must then be manually edited to remove data points that do not have 5 stations per sampling period as well as by combining data points that should fall into a single week. The "simcoe_manual_subset_weeks_5.csv" is then used for the calculation of variability, stabilization, asynchrony, and Shannon Diversity for each year in the .R file "Simcoe MS - 5 Station Calculations". The final .R file "Simcoe MS - 5 Station Analysis contains the final statistical analyses as well as code to reproduce the original figures. Data and code for main and supplementary analyses are also available on GitHub (https://github.com/reillyoc/ZPseasonalPEs).
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic...
zenodo.org
data.niaid.nih.gov
bin, csv, zip
Updated Dec 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander R. Hartloper; Alexander R. Hartloper; Selimcan Ozden; Albano de Castro e Sousa; Dimitrios G. Lignos; Dimitrios G. Lignos; Selimcan Ozden; Albano de Castro e Sousa (2022). Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials [Dataset]. http://doi.org/10.5281/zenodo.6965147
Explore at:
bin, zip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6965147
Dataset updated
Dec 24, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alexander R. Hartloper; Alexander R. Hartloper; Selimcan Ozden; Albano de Castro e Sousa; Dimitrios G. Lignos; Dimitrios G. Lignos; Selimcan Ozden; Albano de Castro e Sousa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials

Background

This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.

The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).

Usage

The data is licensed through the Creative Commons Attribution 4.0 International.

If you have used our data and are publishing your work, we ask that you please reference both:

this database through its DOI, and

any publication that is associated with the experiments. See the Overall_Summary and Database_References files for the associated publication references.

Included Files

Overall_Summary_2022-08-25_v1-0-0.csv: summarises the specimen information for all experiments in the database.

Summarized_Mechanical_Props_Campaign_2022-08-25_v1-0-0.csv: summarises the average initial yield stress and average initial elastic modulus per campaign.

Unreduced_Data-#_v1-0-0.zip: contain the original (not downsampled) data

Where # is one of: 1, 2, 3, 4, 5, 6. The unreduced data is broken into separate archives because of upload limitations to Zenodo. Together they provide all the experimental data.

We recommend you un-zip all the folders and place them in one "Unreduced_Data" directory similar to the "Clean_Data"

The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.

There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the unreduced data.

The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.

Clean_Data_v1-0-0.zip: contains all the downsampled data

The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.

There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the clean data.

The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.

Database_References_v1-0-0.bib

Contains a bibtex reference for many of the experiments in the database. Corresponds to the "citekey" entry in the summary files.

File Format: Downsampled Data

These are the "LP_

The header of the first column is empty: the first column corresponds to the index of the sample point in the original (unreduced) data

Time[s]: time in seconds since the start of the test

e_true: true strain

Sigma_true: true stress in MPa

(optional) Temperature[C]: the surface temperature in degC

These data files can be easily loaded using the pandas library in Python through:

import pandas data = pandas.read_csv(data_file, index_col=0)

The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.

File Format: Unreduced Data

These are the "LP_

The first column is the index of each data point

S/No: sample number recorded by the DAQ

System Date: Date and time of sample

Time[s]: time in seconds since the start of the test

C_1_Force[kN]: load cell force

C_1_Déform1[mm]: extensometer displacement

C_1_Déplacement[mm]: cross-head displacement

Eng_Stress[MPa]: engineering stress

Eng_Strain[]: engineering strain

e_true: true strain

Sigma_true: true stress in MPa

(optional) Temperature[C]: specimen surface temperature in degC

The data can be loaded and used similarly to the downsampled data.

File Format: Overall_Summary

The overall summary file provides data on all the test specimens in the database. The columns include:

hidden_index: internal reference ID

grade: material grade

spec: specifications for the material

source: base material for the test specimen

id: internal name for the specimen

lp: load protocol

size: type of specimen (M8, M12, M20)

gage_length_mm_: unreduced section length in mm

avg_reduced_dia_mm_: average measured diameter for the reduced section in mm

avg_fractured_dia_top_mm_: average measured diameter of the top fracture surface in mm

avg_fractured_dia_bot_mm_: average measured diameter of the bottom fracture surface in mm

fy_n_mpa_: nominal yield stress

fu_n_mpa_: nominal ultimate stress

t_a_deg_c_: ambient temperature in degC

date: date of test

investigator: person(s) who conducted the test

location: laboratory where test was conducted

machine: setup used to conduct test

pid_force_k_p, pid_force_t_i, pid_force_t_d: PID parameters for force control

pid_disp_k_p, pid_disp_t_i, pid_disp_t_d: PID parameters for displacement control

pid_extenso_k_p, pid_extenso_t_i, pid_extenso_t_d: PID parameters for extensometer control

citekey: reference corresponding to the Database_References.bib file

yield_stress_mpa_: computed yield stress in MPa

elastic_modulus_mpa_: computed elastic modulus in MPa

fracture_strain: computed average true strain across the fracture surface

c,si,mn,p,s,n,cu,mo,ni,cr,v,nb,ti,al,b,zr,sn,ca,h,fe: chemical compositions in units of %mass

file: file name of corresponding clean (downsampled) stress-strain data

File Format: Summarized_Mechanical_Props_Campaign

Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,

tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv', index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1], keep_default_na=False, na_values='')

citekey: reference in "Campaign_References.bib".

Grade: material grade.

Spec.: specifications (e.g., J2+N).

Yield Stress [MPa]: initial yield stress in MPa

size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

Elastic Modulus [MPa]: initial elastic modulus in MPa

size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

Caveats

The files in the following directories were tested before the protocol was established. Therefore, only the true stress-strain is available for each:

A500

A992_Gr50

BCP325

BCR295

HYP400

S460NL

S690QL/25mm

S355J2_Plates/S355J2_N_25mm and S355J2_N_50mm
Z
Newcastle polysomnography and accelerometer data
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vincent van Hees; Sarah Charman; Kirstie Anderson (2020). Newcastle polysomnography and accelerometer data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1160409
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Netherlands eScience Center, Amsterdam
Newcastle University, Newcastle
5. Regional Sleep Service, Freeman Hospital, Newcastle
Authors
Vincent van Hees; Sarah Charman; Kirstie Anderson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Newcastle PSG+Accelerometer study 2015

This data set contains 55 .bin files, 28 .txt files, and one .csv file, which were collected in Newcastle upon Tyne (UK) to evaluate an accelerometer-based algorithm for sleep classification. The data come form a a single night polysomnography recording in 28 sleep clinic patients. A description of the experimental protocol can be found in this open access PLoSONE paper from 2015: https://doi.org/10.1371/journal.pone.0142533.

Polysomnography

Sleep scores derived from polysomnography are stored in the .txt files. Each file represents a time series (one night) of one participant. The resolution of the scoring is 30 seconds. Participants are numbered. The participant number is included in the file names as “mecsleep01_...”. pariticpants_info.csv is a dictionary of participant number, diagnosis, age, and sex.

Accelerometer data

Accelerometer data from brand GENEActiv (https://www.activinsights.com) are stored in .bin files. Per participant two accelerometers were used: One accelerometer on each wrist (left and right). The right wrist from participant 10 is missing, hence the total number of 55 bin files. The tri-axial (three axis) accelerometers were configured to record at 85.7 Hertz. The accelerometer data can be read with R package GENEAread https://cran.r-project.org/web/packages/GENEAread/index.html. Additional information on the accelerometer can be found on the manufacturers product website: https://www.activinsights.com/resources-support/geneactiv/downloads-software/, including a description of the binary file structure on page 27 of this (pdf) file: https://49wvycy00mv416l561vrj345-wpengine.netdna-ssl.com/wp-content/uploads/2014/03/geneactiv_instruction_manual_v1.2.pdf. The participant number and the body side on which the accelerometer is worn are included in the file names as “MECSLEEP01_left wrist...”.

Participant information

The .csv file as included in this dataset contains a dictionary of the participant numbers, sleep disorder diagnosis, participant age at the time of measurement, and sex.

Example processing

The code we used ourselves to process this data can be found in this GitHub repository: https://github.com/wadpac/psg-ncl-acc-spt-detection-eval. Note that we use R package GGIR: https://cran.r-project.org/web/packages/GGIR/, which calls R package GENEAread for reading the binary data.

CODE-test: An annotated 12-lead ECG dataset

zenodo.org
data.niaid.nih.gov

zip

Updated Jun 7, 2021

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Antonio H Ribeiro; Antonio H Ribeiro; Manoel Horta Ribeiro; Manoel Horta Ribeiro; Gabriela M. Paixão; Gabriela M. Paixão; Derick M. Oliveira; Derick M. Oliveira; Paulo R. Gomes; Paulo R. Gomes; Jéssica A. Canazart; Jéssica A. Canazart; Milton P. Ferreira; Milton P. Ferreira; Carl R. Andersson; Carl R. Andersson; Peter W. Macfarlane; Peter W. Macfarlane; Wagner Meira Jr.; Wagner Meira Jr.; Thomas B. Schön; Thomas B. Schön; Antonio Luiz P. Ribeiro; Antonio Luiz P. Ribeiro (2021). CODE-test: An annotated 12-lead ECG dataset [Dataset]. http://doi.org/10.5281/zenodo.3765780

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.3765780

Dataset updated

Jun 7, 2021

Dataset provided by

Zenodohttp://zenodo.org/

Authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

# Annotated 12 lead ECG dataset

Contain 827 ECG tracings from different patients, annotated by several cardiologists, residents and medical students. It is used as test set on the paper: "Automatic diagnosis of the 12-lead ECG using a deep neural network". https://www.nature.com/articles/s41467-020-15432-4.

It contain annotations about 6 different ECGs abnormalities:
- 1st degree AV block (1dAVb);
- right bundle branch block (RBBB);
- left bundle branch block (LBBB);
- sinus bradycardia (SB);
- atrial fibrillation (AF); and,
- sinus tachycardia (ST).

Companion python scripts are available in:
https://github.com/antonior92/automatic-ecg-diagnosis

--------

Citation
```
Ribeiro, A.H., Ribeiro, M.H., Paixão, G.M.M. et al. Automatic diagnosis of the 12-lead ECG using a deep neural network. Nat Commun 11, 1760 (2020). https://doi.org/10.1038/s41467-020-15432-4
```

Bibtex:
```
@article{ribeiro_automatic_2020,
 title = {Automatic Diagnosis of the 12-Lead {{ECG}} Using a Deep Neural Network},
 author = {Ribeiro, Ant{\^o}nio H. and Ribeiro, Manoel Horta and Paix{\~a}o, Gabriela M. M. and Oliveira, Derick M. and Gomes, Paulo R. and Canazart, J{\'e}ssica A. and Ferreira, Milton P. S. and Andersson, Carl R. and Macfarlane, Peter W. and Meira Jr., Wagner and Sch{\"o}n, Thomas B. and Ribeiro, Antonio Luiz P.},
 year = {2020},
 volume = {11},
 pages = {1760},
 doi = {https://doi.org/10.1038/s41467-020-15432-4},
 journal = {Nature Communications},
 number = {1}
}
```
-----


## Folder content:

- `ecg_tracings.hdf5`: The HDF5 file containing a single dataset named `tracings`. This dataset is a `(827, 4096, 12)` tensor. The first dimension correspond to the 827 different exams from different patients; the second dimension correspond to the 4096 signal samples; the third dimension to the 12 different leads of the ECG exams in the following order: `{DI, DII, DIII, AVR, AVL, AVF, V1, V2, V3, V4, V5, V6}`. 

The signals are sampled at 400 Hz. Some signals originally have a duration of 10 seconds (10 * 400 = 4000 samples) and others of 7 seconds (7 * 400 = 2800 samples). In order to make them all have the same size (4096 samples) we fill them with zeros on both sizes. For instance, for a 7 seconds ECG signal with 2800 samples we include 648 samples at the beginning and 648 samples at the end, yielding 4096 samples that are them saved in the hdf5 dataset. All signal are represented as floating point numbers at the scale 1e-4V: so it should be multiplied by 1000 in order to obtain the signals in V.

In python, one can read this file using the following sequence:
```python
import h5py
with h5py.File(args.tracings, "r") as f:
  x = np.array(f['tracings'])
```

- The file `attributes.csv` contain basic patient attributes: sex (M or F) and age. It
contain 827 lines (plus the header). The i-th tracing in `ecg_tracings.hdf5` correspond to the i-th line.
- `annotations/`: folder containing annotations csv format. Each csv file contain 827 lines (plus the header). The i-th line correspond to the i-th tracing in `ecg_tracings.hdf5` correspond to the in all csv files. The csv files all have 6 columns `1dAVb, RBBB, LBBB, SB, AF, ST`
corresponding to weather the annotator have detect the abnormality in the ECG (`=1`) or not (`=0`).
 1. `cardiologist[1,2].csv` contain annotations from two different cardiologist.
 2. `gold_standard.csv` gold standard annotation for this test dataset. When the cardiologist 1 and cardiologist 2 agree, the common diagnosis was considered as gold standard. In cases where there was any disagreement, a third senior specialist, aware of the annotations from the other two, decided the diagnosis. 
 3. `dnn.csv` prediction from the deep neural network described in the paper. THe threshold is set in such way it maximizes the F1 score.
 4. `cardiology_residents.csv` annotations from two 4th year cardiology residents (each annotated half of the dataset).
 5. `emergency_residents.csv` annotations from two 3rd year emergency residents (each annotated half of the dataset).
 6. `medical_students.csv` annotations from two 5th year medical students (each annotated half of the dataset).

ERA-NUTS: time-series based on C3S ERA5 for European regions
zenodo.org
nc, zip
Updated Aug 4, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M. De Felice; M. De Felice; K. Kavvadias; K. Kavvadias (2022). ERA-NUTS: time-series based on C3S ERA5 for European regions [Dataset]. http://doi.org/10.5281/zenodo.2650191
Explore at:
zip, ncAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2650191
Dataset updated
Aug 4, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
M. De Felice; M. De Felice; K. Kavvadias; K. Kavvadias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
# ERA-NUTS (1980-2018)

This dataset contains a set of time-series of meteorological variables based on Copernicus Climate Change Service (C3S) ERA5 reanalysis. The data files can be downloaded from here while notebooks and other files can be found on the associated Github repository.

This data has been generated with the aim of providing hourly time-series of the meteorological variables commonly used for power system modelling and, more in general, studies on energy systems.

An example of the analysis that can be performed with ERA-NUTS is shown in this video.

Important: this dataset is still a work-in-progress, we will add more analysis and variables in the near-future. If you spot an error or something strange in the data please tell us sending an email or opening an Issue in the associated Github repository.

## Data
The time-series have hourly/daily/monthly frequency and are aggregated following the NUTS 2016 classification. NUTS (Nomenclature of Territorial Units for Statistics) is a European Union standard for referencing the subdivisions of countries (member states, candidate countries and EFTA countries).

This dataset contains NUTS0/1/2 time-series for the following variables obtained from the ERA5 reanalysis data (in brackets the name of the variable on the Copernicus Data Store and its unit measure):

- t2m: 2-meter temperature (`2m_temperature`, Celsius degrees)
- ssrd: Surface solar radiation (`surface_solar_radiation_downwards`, Watt per square meter)
- ssrdc: Surface solar radiation clear-sky (`surface_solar_radiation_downward_clear_sky`, Watt per square meter)
- ro: Runoff (`runoff`, millimeters)

There are also a set of derived variables:
- ws10: Wind speed at 10 meters (derived by `10m_u_component_of_wind` and `10m_v_component_of_wind`, meters per second)
- ws100: Wind speed at 100 meters (derived by `100m_u_component_of_wind` and `100m_v_component_of_wind`, meters per second)
- CS: Clear-Sky index (the ratio between the solar radiation and the solar radiation clear-sky)
- HDD/CDD: Heating/Cooling Degree days (derived by 2-meter temperature the EUROSTAT definition.

For each variable we have 350 599 hourly samples (from 01-01-1980 00:00:00 to 31-12-2019 23:00:00) for 34/115/309 regions (NUTS 0/1/2).

The data is provided in two formats:

- NetCDF version 4 (all the variables hourly and CDD/HDD daily). NOTE: the variables are stored as `int16` type using a `scale_factor` of 0.01 to minimise the size of the files.
- Comma Separated Value ("single index" format for all the variables and the time frequencies and "stacked" only for daily and monthly)

All the CSV files are stored in a zipped file for each variable.

## Methodology

The time-series have been generated using the following workflow:

1. The NetCDF files are downloaded from the Copernicus Data Store from the ERA5 hourly data on single levels from 1979 to present dataset
2. The data is read in R with the climate4r packages and aggregated using the function `/get_ts_from_shp` from panas. All the variables are aggregated at the NUTS boundaries using the average except for the runoff, which consists of the sum of all the grid points within the regional/national borders.
3. The derived variables (wind speed, CDD/HDD, clear-sky) are computed and all the CSV files are generated using R
4. The NetCDF are created using `xarray` in Python 3.7.

NOTE: air temperature, solar radiation, runoff and wind speed hourly data have been rounded with two decimal digits.

## Example notebooks

In the folder `notebooks` on the associated Github repository there are two Jupyter notebooks which shows how to deal effectively with the NetCDF data in `xarray` and how to visualise them in several ways by using matplotlib or the enlopy package.

There are currently two notebooks:

- exploring-ERA-NUTS: it shows how to open the NetCDF files (with Dask), how to manipulate and visualise them.
- ERA-NUTS-explore-with-widget: explorer interactively the datasets with [jupyter]() and ipywidgets.

The notebook `exploring-ERA-NUTS` is also available rendered as HTML.

## Additional files

In the folder `additional files`on the associated Github repository there is a map showing the spatial resolution of the ERA5 reanalysis and a CSV file specifying the number of grid points with respect to each NUTS0/1/2 region.

## License

This dataset is released under CC-BY-4.0 license.
GENEActiv accelerometer file related to the #120 OxWearables / stepcount...
zenodo.org
data.niaid.nih.gov
Updated Nov 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guillaume Wattelez; Guillaume Wattelez (2024). GENEActiv accelerometer file related to the #120 OxWearables / stepcount issue [Dataset]. http://doi.org/10.5281/zenodo.14213237
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.14213237
Dataset updated
Nov 25, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Guillaume Wattelez; Guillaume Wattelez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Oct 2, 2018 - Oct 6, 2018
Description
An example of .bin file that have an IndexError when processing.

Consider https://github.com/OxWearables/stepcount/issues/120" target="_blank" rel="noopener">#120 OxWearables / stepcount issue for more details.

The .csv files are 1-second epoch conversions from the .bin file and contain time, x, y, z columns. The conversion was done by:

reading the .bin with the https://www.rdocumentation.org/packages/GENEAread/" target="_blank" rel="noopener">GENEAread R package.

keeping only the time, x, y and z columns.

saving the data.frame into a .csv file.

The only difference between the .csv files is the column format used for the time column before saving:

time column in XXXXXX_....csv had a string class

time column in XXXXXT....csv had a "POSIXct" "POSIXt" class
Z
ERA-NUTS: meteorological time-series based on C3S ERA5 for European regions...
data-staging.niaid.nih.gov
data.niaid.nih.gov
Updated Aug 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M. De Felice; K. Kavvadias (2022). ERA-NUTS: meteorological time-series based on C3S ERA5 for European regions (1980-2021) [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_2650190
Explore at:
Dataset updated
Aug 4, 2022
Dataset provided by
European Commission, Joint Research Centre (JRC)
Authors
M. De Felice; K. Kavvadias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ERA-NUTS (1980-2021)

This dataset contains a set of time-series of meteorological variables based on Copernicus Climate Change Service (C3S) ERA5 reanalysis. The data files can be downloaded from here while notebooks and other files can be found on the associated Github repository.

This data has been generated with the aim of providing hourly time-series of the meteorological variables commonly used for power system modelling and, more in general, studies on energy systems.

An example of the analysis that can be performed with ERA-NUTS is shown in this video.

Important: this dataset is still a work-in-progress, we will add more analysis and variables in the near-future. If you spot an error or something strange in the data please tell us sending an email or opening an Issue in the associated Github repository.

Data

The time-series have hourly/daily/monthly frequency and are aggregated following the NUTS 2016 classification. NUTS (Nomenclature of Territorial Units for Statistics) is a European Union standard for referencing the subdivisions of countries (member states, candidate countries and EFTA countries).

This dataset contains NUTS0/1/2 time-series for the following variables obtained from the ERA5 reanalysis data (in brackets the name of the variable on the Copernicus Data Store and its unit measure):

t2m: 2-meter temperature (2m_temperature, Celsius degrees)

ssrd: Surface solar radiation (surface_solar_radiation_downwards, Watt per square meter)

ssrdc: Surface solar radiation clear-sky (surface_solar_radiation_downward_clear_sky, Watt per square meter)

ro: Runoff (runoff, millimeters)

sd: Snow depth (sd, meters)

There are also a set of derived variables: - ws10: Wind speed at 10 meters (derived by 10m_u_component_of_wind and 10m_v_component_of_wind, meters per second) - ws100: Wind speed at 100 meters (derived by 100m_u_component_of_wind and 100m_v_component_of_wind, meters per second) - CS: Clear-Sky index (the ratio between the solar radiation and the solar radiation clear-sky) - RH: Relative Humidity (computed following Lawrence, BAMS 2005 and Alduchov & Eskridge, 1996) - HDD/CDD: Heating/Cooling Degree days (derived by 2-meter temperature the EUROSTAT definition.

For each variable we have 367 440 hourly samples (from 01-01-1980 00:00:00 to 31-12-2021 23:00:00) for 34/115/309 regions (NUTS 0/1/2).

The data is provided in two formats:

NetCDF version 4 (all the variables hourly and CDD/HDD daily). NOTE: the variables are stored as int16 type using a scale_factor to minimise the size of the files.

Comma Separated Value ("single index" format for all the variables and the time frequencies and "stacked" only for daily and monthly)

All the CSV files are stored in a zipped file for each variable.

Methodology

The time-series have been generated using the following workflow:

The NetCDF files are downloaded from the Copernicus Data Store from the ERA5 hourly data on single levels from 1979 to present dataset

The data is read in R with the climate4r packages and aggregated using the function /get_ts_from_shp from panas. All the variables are aggregated at the NUTS boundaries using the average except for the runoff, which consists of the sum of all the grid points within the regional/national borders.

The derived variables (wind speed, CDD/HDD, clear-sky) are computed and all the CSV files are generated using R

The NetCDF are created using xarray in Python 3.8.

Example notebooks

In the folder notebooks on the associated Github repository there are two Jupyter notebooks which shows how to deal effectively with the NetCDF data in xarray and how to visualise them in several ways by using matplotlib or the enlopy package.

There are currently two notebooks:

exploring-ERA-NUTS: it shows how to open the NetCDF files (with Dask), how to manipulate and visualise them.

ERA-NUTS-explore-with-widget: explorer interactively the datasets with jupyter and ipywidgets.

The notebook exploring-ERA-NUTS is also available rendered as HTML.

Additional files

In the folder additional fileson the associated Github repository there is a map showing the spatial resolution of the ERA5 reanalysis and a CSV file specifying the number of grid points with respect to each NUTS0/1/2 region.

License

This dataset is released under CC-BY-4.0 license.

Changelog

2022-04-08 Added Relative Humidity (RH) 2022-03-07 Added the missing month in CDD/HDD 2022-02-08 Updated the wind speed and temperature data due to missing months.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

bin, application/gzip, zip, text/x-pythonAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.1419788

Dataset updated

Aug 2, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb

License

https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

Description

Replication pack, FSE2018 submission #164:
------------------------------------------

**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
A Case Study of the PyPI Ecosystem

**Note:** link to data artifacts is already included in the paper. 
Link to the code will be included in the Camera Ready version as well.


Content description
===================

- **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
 described below
- **settings.py** - settings template for the code archive.
- **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
 This dataset only includes stats aggregated by the ecosystem (PyPI)
- **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
 statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
 themselves, which take around 2TB.
- **build_model.r, helpers.r** - R files to process the survival data 
  (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
  `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
  **dataset_full_Jan_2018.tgz**)
- **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
- LICENSE - text of GPL v3, under which this dataset is published
- INSTALL.md - replication guide (~2 pages)

Replication guide
=================

Step 0 - prerequisites
----------------------

- Unix-compatible OS (Linux or OS X)
- Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
- R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)

Depending on detalization level (see Step 2 for more details):
- up to 2Tb of disk space (see Step 2 detalization levels)
- at least 16Gb of RAM (64 preferable)
- few hours to few month of processing time

Step 1 - software
----------------

- unpack **ghd-0.1.0.zip**, or clone from gitlab:

   git clone https://gitlab.com/user2589/ghd.git
   git checkout 0.1.0
 
 `cd` into the extracted folder. 
 All commands below assume it as a current directory.
  
- copy `settings.py` into the extracted folder. Edit the file:
  * set `DATASET_PATH` to some newly created folder path
  * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
- install docker. For Ubuntu Linux, the command is 
  `sudo apt-get install docker-compose`
- install libarchive and headers: `sudo apt-get install libarchive-dev`
- (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
 Without this dependency, you might get an error on the next step, 
 but it's safe to ignore.
- install Python libraries: `pip install --user -r requirements.txt` . 
- disable all APIs except GitHub (Bitbucket and Gitlab support were
 not yet implemented when this study was in progress): edit
 `scraper/init.py`, comment out everything except GitHub support
 in `PROVIDERS`.

Step 2 - obtaining the dataset
-----------------------------

The ultimate goal of this step is to get output of the Python function 
`common.utils.survival_data()` and save it into a CSV file:

  # copy and paste into a Python console
  from common import utils
  survival_data = utils.survival_data('pypi', '2008', smoothing=6)
  survival_data.to_csv('survival_data.csv')

Since full replication will take several months, here are some ways to speedup
the process:

####Option 2.a, difficulty level: easiest

Just use the precomputed data. Step 1 is not necessary under this scenario.

- extract **dataset_minimal_Jan_2018.zip**
- get `survival_data.csv`, go to the next step

####Option 2.b, difficulty level: easy

Use precomputed longitudinal feature values to build the final table.
The whole process will take 15..30 minutes.

- create a folder `

Clear search

Close search

Google apps

Main menu

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

Dataset for Assessing Multi-Dimensional Impacts of Achieving Sustainability...

Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...

Dollar-Rial-Toman Live Price Dataset

Dollar-Rial-Toman Live Price Dataset

Dataset Overview

Data Structure

Download the Data

View Scraper and workflow source on GitHub

Documentation & Charts

Loading in Python

Direct Load in Python

Loading in R

Data Quality & Updates

Technical Implementation

Contributing

License

Citation

Keywords

Disclaimer

The North Pacific Eukaryotic Gene Catalog: clustered nucleotide...

Data from: Generalizable EHR-R-REDCap pipeline for a national...

Data from: Reference transcriptomics of porcine peripheral immune cells...

Food and Agriculture Biomass Input–Output (FABIO) database

Data from: Species Portfolio Effects Dominate Seasonal Zooplankton...

Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic...

Newcastle polysomnography and accelerometer data

Newcastle PSG+Accelerometer study 2015

Polysomnography

Accelerometer data

Participant information

Example processing

CODE-test: An annotated 12-lead ECG dataset

ERA-NUTS: time-series based on C3S ERA5 for European regions

GENEActiv accelerometer file related to the #120 OxWearables / stepcount...

ERA-NUTS: meteorological time-series based on C3S ERA5 for European regions...

ERA-NUTS (1980-2021)

Data

Methodology

Example notebooks

Additional files

License

Changelog

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI EcosystemSee More Versions

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem