15 datasets found
  1. Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

    • zenodo.org
    application/gzip, bin +2
    Updated Aug 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788
    Explore at:
    bin, application/gzip, zip, text/x-pythonAvailable download formats
    Dataset updated
    Aug 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb
    License

    https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

    Description
    Replication pack, FSE2018 submission #164:
    ------------------------------------------
    
    **Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
    A Case Study of the PyPI Ecosystem
    
    **Note:** link to data artifacts is already included in the paper. 
    Link to the code will be included in the Camera Ready version as well.
    
    
    Content description
    ===================
    
    - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
     described below
    - **settings.py** - settings template for the code archive.
    - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
     This dataset only includes stats aggregated by the ecosystem (PyPI)
    - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
     statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
     themselves, which take around 2TB.
    - **build_model.r, helpers.r** - R files to process the survival data 
      (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
      `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
      **dataset_full_Jan_2018.tgz**)
    - **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
    - LICENSE - text of GPL v3, under which this dataset is published
    - INSTALL.md - replication guide (~2 pages)
    Replication guide
    =================
    
    Step 0 - prerequisites
    ----------------------
    
    - Unix-compatible OS (Linux or OS X)
    - Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
    - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)
    
    Depending on detalization level (see Step 2 for more details):
    - up to 2Tb of disk space (see Step 2 detalization levels)
    - at least 16Gb of RAM (64 preferable)
    - few hours to few month of processing time
    
    Step 1 - software
    ----------------
    
    - unpack **ghd-0.1.0.zip**, or clone from gitlab:
    
       git clone https://gitlab.com/user2589/ghd.git
       git checkout 0.1.0
     
     `cd` into the extracted folder. 
     All commands below assume it as a current directory.
      
    - copy `settings.py` into the extracted folder. Edit the file:
      * set `DATASET_PATH` to some newly created folder path
      * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
    - install docker. For Ubuntu Linux, the command is 
      `sudo apt-get install docker-compose`
    - install libarchive and headers: `sudo apt-get install libarchive-dev`
    - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
     Without this dependency, you might get an error on the next step, 
     but it's safe to ignore.
    - install Python libraries: `pip install --user -r requirements.txt` . 
    - disable all APIs except GitHub (Bitbucket and Gitlab support were
     not yet implemented when this study was in progress): edit
     `scraper/init.py`, comment out everything except GitHub support
     in `PROVIDERS`.
    
    Step 2 - obtaining the dataset
    -----------------------------
    
    The ultimate goal of this step is to get output of the Python function 
    `common.utils.survival_data()` and save it into a CSV file:
    
      # copy and paste into a Python console
      from common import utils
      survival_data = utils.survival_data('pypi', '2008', smoothing=6)
      survival_data.to_csv('survival_data.csv')
    
    Since full replication will take several months, here are some ways to speedup
    the process:
    
    ####Option 2.a, difficulty level: easiest
    
    Just use the precomputed data. Step 1 is not necessary under this scenario.
    
    - extract **dataset_minimal_Jan_2018.zip**
    - get `survival_data.csv`, go to the next step
    
    ####Option 2.b, difficulty level: easy
    
    Use precomputed longitudinal feature values to build the final table.
    The whole process will take 15..30 minutes.
    
    - create a folder `
  2. Z

    Dataset for Assessing Multi-Dimensional Impacts of Achieving Sustainability...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Mar 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kyle, Page; Ollenburger, Mary; Zhang, Xin; Niazi, Hassan; Durga, Siddarth; Ou, Yang (2024). Dataset for Assessing Multi-Dimensional Impacts of Achieving Sustainability Goals by Projecting the Sustainable Agriculture Matrix into the Future [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7251098
    Explore at:
    Dataset updated
    Mar 17, 2024
    Dataset provided by
    Joint Global Change Research Institute, Pacific Northwest National Laboratory (JGCRI-PNNL), College Park, MD, USA
    Appalachian Laboratory, University of Maryland Center for Environmental Science (UMCES), Frostburg, MD, USA
    Authors
    Kyle, Page; Ollenburger, Mary; Zhang, Xin; Niazi, Hassan; Durga, Siddarth; Ou, Yang
    License

    https://opensource.org/licenses/BSD-2-Clausehttps://opensource.org/licenses/BSD-2-Clause

    Description

    This data repository feeds into the meta-repository setup for post-processing of GCAM-SAM outputs. GitHub link of meta-repository is: https://github.com/JGCRI/Kyle-etal_2022_EF

    Folders: model/ is the static version of the model used to simulate 8 scenarios. See the GitHub GCAM-SAM repository to follow active development of this model. inputs/ folder contains input datasets and scripts used to prepare files while postprocessing. This is to be used with GitHub post-processing meta-repository. outdata/ contains GCAM-SAM output and post-processed output files used to plot figures.

    Key files: SAM-matrix.dat is the consolidated GCAM-SAM output. Use proj_load.R in the metarepo to read the file. region_vals.csv has all 8 indicators in all 8 scenarios for years 2020 till 2100 on a 10 year time step.

    Short introduction to the study:

    In this paper sustainable agriculture matrix (SAM) is estimated to 2100 using Global Change Analysis Model (GCAM). We model combinatorial variations of yield intensification, dietary shift, and greenhouse gas mitigation scenarios. Findings include scenarios having significant tradeoffs across multiple environmental, economic, and social dimensions. Assessment of these multi-dimensional tradeoffs in a consistent framework improves the quality of information for decision-making.

    Should you have any questions, feel free to reach out Page Kyle at pkyle@pnnl.gov.

  3. Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    zip
    Updated Dec 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dylan Westfall; Mullins James (2023). Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies [Dataset]. http://doi.org/10.5061/dryad.w3r2280w0
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 7, 2023
    Dataset provided by
    National Institute of Allergy and Infectious Diseaseshttp://www.niaid.nih.gov/
    HIV Prevention Trials Networkhttp://www.hptn.org/
    HIV Vaccine Trials Networkhttp://www.hvtn.org/
    PEPFAR
    Authors
    Dylan Westfall; Mullins James
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies. Methods This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies" Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005 For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub. The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub. The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results. Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program. To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper. Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd. Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.

  4. Dollar-Rial-Toman Live Price Dataset

    • kaggle.com
    zip
    Updated Nov 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Koorosh Komeilizadeh (2025). Dollar-Rial-Toman Live Price Dataset [Dataset]. https://www.kaggle.com/datasets/kooroshkz/dollar-rial-toman-live-price-dataset
    Explore at:
    zip(66708 bytes)Available download formats
    Dataset updated
    Nov 7, 2025
    Authors
    Koorosh Komeilizadeh
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dollar-Rial-Toman Live Price Dataset

    A comprehensive, daily-updated dataset of US Dollar to Iranian Rial exchange rates (USD/IRR) with historical data from November 2011 to present. This dataset is ideal for financial analysis, economic research, forecasting, and machine learning projects.

    Dataset Overview

    • Time Period: November 26, 2011 - Present (continuously updated)
    • Total Records: 3,648+ daily price points
    • Data Source: TGJU.org (Tehran Gold & Jewelry Union)
    • Update Frequency: Daily (automated via GitHub Actions)
    • Format: CSV with proper date formatting and integer price structure

    Data Structure

    The CSV file contains the following columns:

    ColumnDescriptionFormatExample
    Open PriceOpening price of the dayInteger1012100
    Low PriceLowest price of the dayInteger1011700
    High PriceHighest price of the dayInteger1034100
    Close PriceClosing price of the dayInteger1029800
    Change AmountPrice change amountString15400
    Change PercentPrice change percentageString1.52%
    Gregorian DateGregorian dateYYYY/MM/DD2025/09/06
    Persian DatePersian/Shamsi dateYYYY/MM/DD1404/06/15

    Download the Data

    View Scraper and workflow source on GitHub

    This live dataset, scraper source code and workflow is available on GitHub where you can explore, download, and use it directly.

    Documentation & Charts

    "https://kooroshkz.github.io/Dollar-Rial-Toman-Live-Price-Dataset/" target="_blank"> imagehttps://raw.githubusercontent.com/kooroshkz/Dollar-Rial-Toman-Live-Price-Dataset/main/assets/img/IntractiveChart.png">

    Interactive charts and dataset overview are available at:
    kooroshkz.github.io/Dollar-Rial-Toman-Live-Price-Dataset

    Loading in Python

    import pandas as pd
    
    # Load dataset
    df = pd.read_csv('data/Dollar_Rial_Price_Dataset.csv')
    
    # Convert date column to datetime
    df['Gregorian Date'] = pd.to_datetime(df['Gregorian Date'], format='%Y/%m/%d')
    
    # Price columns are already integers
    price_columns = ['Open Price', 'Low Price', 'High Price', 'Close Price']
    print(df[price_columns].dtypes) # All should be int64
    

    Direct Load in Python

    # pip install kagglehub[hf-datasets]
    import kagglehub
    
    df = kagglehub.load_dataset(
      "kooroshkz/dollar-rial-toman-live-price-dataset",
      adapter="huggingface",
      file_path="Dollar_Rial_Price_Dataset.csv",
      pandas_kwargs={"parse_dates": ["Gregorian Date"]}
    )
    
    print(df.head())
    

    Loading in R

    # Load dataset
    data <- read.csv("data/Dollar_Rial_Price_Dataset.csv", stringsAsFactors = FALSE)
    
    # Convert date column
    data$Gregorian.Date <- as.Date(data$Gregorian.Date, format = "%Y/%m/%d")
    
    # View structure
    str(data)
    

    Data Quality & Updates

    • Validation: All price data undergoes validation checks for accuracy
    • Automated Updates: Dataset is automatically updated daily at 8:00 AM UTC
    • Data Integrity: Built-in duplicate prevention and format validation
    • Historical Consistency: Maintains consistent formatting across all time periods
    • Integer Prices: All price values stored as integers for precise calculations

    Technical Implementation

    This dataset is maintained using an automated web scraping system that:

    • Monitors TGJU.org for new exchange rate data
    • Validates and processes new records
    • Maintains data consistency and prevents duplicates
    • Automatically commits updates to the repository

    Contributing

    If you find data inconsistencies or have suggestions for improvements, please open an issue in the GitHub repository.

    License

    This project is licensed under the MIT License - see the LICENSE file for details.

    Citation

    If you use this dataset in your research or projects, please cite:

    Dollar-Rial-Toman Live Price Dataset
    Author: Koorosh Komeili Zadeh
    Source: https://github.com/kooroshkz/Dollar-Rial-Toman-Live-Price-Dataset
    Data Source: TGJU.org (Tehran Gold & Jewelry Union)
    Date Range: November 2011 - Present
    

    Keywords

    USD to Rial dataset, Dollar to Toman dataset, Iran exchange rate CSV, USD/IRR daily price, foreign exchange Iran dataset, TGJU data, time series currency dataset

    Disclaimer

    This ...

  5. The North Pacific Eukaryotic Gene Catalog: clustered nucleotide...

    • zenodo.org
    application/gzip
    Updated Jan 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ryan Groussman; Ryan Groussman; Sacha Coesel; Sacha Coesel; E. Virginia Armbrust; E. Virginia Armbrust (2025). The North Pacific Eukaryotic Gene Catalog: clustered nucleotide metatranscripts and read counts [Dataset]. http://doi.org/10.5281/zenodo.13826820
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jan 22, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ryan Groussman; Ryan Groussman; Sacha Coesel; Sacha Coesel; E. Virginia Armbrust; E. Virginia Armbrust
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data continues with the development of the NPEGC Trinity de novo metatranscriptome assemblies from the protein data repository of The North Pacific Eukaryotic Gene Catalog. The nucleotide sequences corresponding to the NPEGC cluster representatives are collected together in these repository files:

    NPac.G1PA.bf100.id99.nt.fasta.gz
    NPac.G2PA.bf100.id99.nt.fasta.gz
    NPac.G3PA.bf100.id99.nt.fasta.gz
    NPac.G3PA_diel.bf100.id99.nt.fasta.gz
    NPac.D1PA.bf100.id99.nt.fasta.gz

    A full description of this data is published in Scientific Data, available here: The North Pacific Eukaryotic Gene Catalog of metatranscriptome assemblies and annotations. Please cite this publication if your research uses this data:

    Groussman, R. D., Coesel, S. N., Durham, B. P., Schatz, M. J., & Armbrust, E. V. (2024). The North Pacific Eukaryotic Gene Catalog of metatranscriptome assemblies and annotations. Scientific Data, 11(1), 1161.

    These nucleotide sequences have been sourced from the Zenodo repository for raw assemblies: The North Pacific Eukaryotic Gene Catalog: Raw assemblies from Gradients 1, 2 and 3

    Key processing steps are sampled below with links to the detailed code on the main github code repository: https://github.com/armbrustlab/NPac_euk_gene_catalog


    Code used to build the kallisto indices and map the short reads against indices with kallisto are online in the code repository here: NPEGC.nt_kallisto_counts.sh

    There are two main steps:
    1. Generate the kallisto index on the sets of clustered nucleotide metatranscripts
    2. Map the short reads from environmental samples back to the assembly index

    As generated above, kallisto generates separate results files for each of the sample files. Even after compression, the total size of the tarballed kallisto output results directories are prohibitively large (>50GB). We use the code in this template R script to join together the 'est_count' estimated count values for the tens of millions of protein sequences in each project metatranscriptome, along with length.

    The code in this template script was used for each project: aggregate_kallisto_counts.R
    The output count files for each project are Gzip-compressed and uploaded to the NPEGC nucleotide data repository here:

    G1PA.raw.est_counts.csv.gz
    G2PA.raw.est_counts.csv.gz
    G3PA.raw.est_counts.csv.gz
    G3PA_diel.raw.est_counts.csv.gz
    D1PA.raw.est_counts.csv.gz

  6. n

    Data from: Generalizable EHR-R-REDCap pipeline for a national...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Jan 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 9, 2022
    Dataset provided by
    Harvard Medical School
    Massachusetts General Hospital
    Authors
    Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

    Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

    Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

    Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.

    Methods eLAB Development and Source Code (R statistical software):

    eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).

    eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.

    Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.

    The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).

    Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.

    Data Dictionary (DD)

    EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.

    Study Cohort

    This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.

    Statistical Analysis

    OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.

  7. d

    Data from: Reference transcriptomics of porcine peripheral immune cells...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    • +2more
    Updated Jun 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data from: Reference transcriptomics of porcine peripheral immune cells created through bulk and single-cell RNA sequencing [Dataset]. https://catalog.data.gov/dataset/data-from-reference-transcriptomics-of-porcine-peripheral-immune-cells-created-through-bul-e667c
    Explore at:
    Dataset updated
    Jun 5, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    This dataset contains files reconstructing single-cell data presented in 'Reference transcriptomics of porcine peripheral immune cells created through bulk and single-cell RNA sequencing' by Herrera-Uribe & Wiarda et al. 2021. Samples of peripheral blood mononuclear cells (PBMCs) were collected from seven pigs and processed for single-cell RNA sequencing (scRNA-seq) in order to provide a reference annotation of porcine immune cell transcriptomics at enhanced, single-cell resolution. Analysis of single-cell data allowed identification of 36 cell clusters that were further classified into 13 cell types, including monocytes, dendritic cells, B cells, antibody-secreting cells, numerous populations of T cells, NK cells, and erythrocytes. Files may be used to reconstruct the data as presented in the manuscript, allowing for individual query by other users. Scripts for original data analysis are available at https://github.com/USDA-FSEPRU/PorcinePBMCs_bulkRNAseq_scRNAseq. Raw data are available at https://www.ebi.ac.uk/ena/browser/view/PRJEB43826. Funding for this dataset was also provided by NRSP8: National Animal Genome Research Program (https://www.nimss.org/projects/view/mrp/outline/18464). Resources in this dataset:Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells 10X Format. File Name: PBMC7_AllCells.zipResource Description: Zipped folder containing PBMC counts matrix, gene names, and cell IDs. Files are as follows: matrix of gene counts* (matrix.mtx.gx) gene names (features.tsv.gz) cell IDs (barcodes.tsv.gz) *The ‘raw’ count matrix is actually gene counts obtained following ambient RNA removal. During ambient RNA removal, we specified to calculate non-integer count estimations, so most gene counts are actually non-integer values in this matrix but should still be treated as raw/unnormalized data that requires further normalization/transformation. Data can be read into R using the function Read10X().Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells Metadata. File Name: PBMC7_AllCells_meta.csvResource Description: .csv file containing metadata for cells included in the final dataset. Metadata columns include: nCount_RNA = the number of transcripts detected in a cell nFeature_RNA = the number of genes detected in a cell Loupe = cell barcodes; correspond to the cell IDs found in the .h5Seurat and 10X formatted objects for all cells prcntMito = percent mitochondrial reads in a cell Scrublet = doublet probability score assigned to a cell seurat_clusters = cluster ID assigned to a cell PaperIDs = sample ID for a cell celltypes = cell type ID assigned to a cellResource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells PCA Coordinates. File Name: PBMC7_AllCells_PCAcoord.csvResource Description: .csv file containing first 100 PCA coordinates for cells. Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells t-SNE Coordinates. File Name: PBMC7_AllCells_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for all cells.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells UMAP Coordinates. File Name: PBMC7_AllCells_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for all cells.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells t-SNE Coordinates. File Name: PBMC7_CD4only_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells UMAP Coordinates. File Name: PBMC7_CD4only_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells UMAP Coordinates. File Name: PBMC7_GDonly_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells t-SNE Coordinates. File Name: PBMC7_GDonly_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gene Annotation Information. File Name: UnfilteredGeneInfo.txtResource Description: .txt file containing gene nomenclature information used to assign gene names in the dataset. 'Name' column corresponds to the name assigned to a feature in the dataset.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells H5Seurat. File Name: PBMC7.tarResource Description: .h5Seurat object of all cells in PBMC dataset. File needs to be untarred, then read into R using function LoadH5Seurat().

  8. Z

    Food and Agriculture Biomass Input–Output (FABIO) database

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    Updated Jun 8, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bruckner, Martin; Kuschnig, Nikolas (2022). Food and Agriculture Biomass Input–Output (FABIO) database [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2577066
    Explore at:
    Dataset updated
    Jun 8, 2022
    Dataset provided by
    Vienna University of Economics and Business
    Authors
    Bruckner, Martin; Kuschnig, Nikolas
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This data repository provides the Food and Agriculture Biomass Input Output (FABIO) database, a global set of multi-regional physical supply-use and input-output tables covering global agriculture and forestry.

    The work is based on mostly freely available data from FAOSTAT, IEA, EIA, and UN Comtrade/BACI. FABIO currently covers 191 countries + RoW, 118 processes and 125 commodities (raw and processed agricultural and food products) for 1986-2013. All R codes and auxilliary data are available on GitHub. For more information please refer to https://fabio.fineprint.global.

    The database consists of the following main components, in compressed .rds format:

    Z: the inter-commodity input-output matrix, displaying the relationships of intermediate use of each commodity in the production of each commodity, in physical units (tons). The matrix has 24000 rows and columns (125 commodities x 192 regions), and is available in two versions, based on the method to allocate inputs to outputs in production processes: Z_mass (mass allocation) and Z_value (value allocation). Note that the row sums of the Z matrix (= total intermediate use by commodity) are identical in both versions.

    Y: the final demand matrix, denoting the consumption of all 24000 commodities by destination country and final use category. There are six final use categories (yielding 192 x 6 = 1152 columns): 1) food use, 2) other use (non-food), 3) losses, 4) stock addition, 5) balancing, and 6) unspecified.

    X: the total output vector of all 24000 commodities. Total output is equal to the sum of intermediate and final use by commodity.

    L: the Leontief inverse, computed as (I – A)-1, where A is the matrix of input coefficients derived from Z and x. Again, there are two versions, depending on the underlying version of Z (L_mass and L_value).

    E: environmental extensions for each of the 24000 commodities, including four resource categories: 1) primary biomass extraction (in tons), 2) land use (in hectares), 3) blue water use (in m3)., and 4) green water use (in m3).

    mr_sup_mass/mr_sup_value: For each allocation method (mass/value), the supply table gives the physical supply quantity of each commodity by producing process, with processes in the rows (118 processes x 192 regions = 22656 rows) and commodities in columns (24000 columns).

    mr_use: the use table capture the quantities of each commodity (rows) used as an input in each process (columns).

    A description of the included countries and commodities (i.e. the rows and columns of the Z matrix) can be found in the auxiliary file io_codes.csv. Separate lists of the country sample (including ISO3 codes and continental grouping) and commodities (including moisture content) are given in the files regions.csv and items.csv, respectively. For information on the individual processes, see auxiliary file su_codes.csv. RDS files can be opened in R. Information on how to read these files can be obtained here: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/readRDS

    Except of X.rds, which contains a matrix, all variables are organized as lists, where each element contains a sparse matrix. Please note that values are always given in physical units, i.e. tonnes or head, as specified in items.csv. The suffixes value and mass only indicate the form of allocation chosen for the construction of the symmetric IO tables (for more details see Bruckner et al. 2019). Product, process and country classifications can be found in the file fabio_classifications.xlsx.

    Footprint results are not contained in the database but can be calculated, e.g. by using this script: https://github.com/martinbruckner/fabio_comparison/blob/master/R/fabio_footprints.R

    How to cite:

    To cite FABIO work please refer to this paper:

    Bruckner, M., Wood, R., Moran, D., Kuschnig, N., Wieland, H., Maus, V., Börner, J. 2019. FABIO – The Construction of the Food and Agriculture Input–Output Model. Environmental Science & Technology 53(19), 11302–11312. DOI: 10.1021/acs.est.9b03554

    License:

    This data repository is distributed under the CC BY-NC-SA 4.0 License. You are free to share and adapt the material for non-commercial purposes using proper citation. If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. In case you are interested in a collaboration, I am happy to receive enquiries at martin.bruckner@wu.ac.at.

    Known issues:

    The underlying FAO data have been manipulated to the minimum extent necessary. Data filling and supply-use balancing, yet, required some adaptations. These are documented in the code and are also reflected in the balancing item in the final demand matrices. For a proper use of the database, I recommend to distribute the balancing item over all other uses proportionally and to do analyses with and without balancing to illustrate uncertainties.

  9. Z

    Data from: Species Portfolio Effects Dominate Seasonal Zooplankton...

    • data.niaid.nih.gov
    Updated Mar 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    O'Connor, Reilly (2022). Species Portfolio Effects Dominate Seasonal Zooplankton Stabilization Within a Large Temperate Lake [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_6345004
    Explore at:
    Dataset updated
    Mar 16, 2022
    Dataset provided by
    University of Guelph
    Authors
    O'Connor, Reilly
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The raw data file is available online for public access (https://data.ontario.ca/dataset/lake-simcoe-monitoring). Download the 1980-2019 csv files and open up the file named "Simcoe_Zooplankton&Bythotrephes.csv". Copy and paste the zooplankton sheet into a new excel file called "Simcoe_Zooplankton.csv". The column ZDATE in the excel file needs to be switched from GENERAL to SHORT DATE so that the dates in the ZDATE column read "YYYY/MM/DD". Save as .csv in appropriate R folder. The data file "simcoe_manual_subset_weeks_5" is the raw data that has been subset for the main analysis of the article using the .R file "Simcoe MS - 5 Station Subset Data". The .csv file produced from this must then be manually edited to remove data points that do not have 5 stations per sampling period as well as by combining data points that should fall into a single week. The "simcoe_manual_subset_weeks_5.csv" is then used for the calculation of variability, stabilization, asynchrony, and Shannon Diversity for each year in the .R file "Simcoe MS - 5 Station Calculations". The final .R file "Simcoe MS - 5 Station Analysis contains the final statistical analyses as well as code to reproduce the original figures. Data and code for main and supplementary analyses are also available on GitHub (https://github.com/reillyoc/ZPseasonalPEs).

  10. Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic...

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, zip
    Updated Dec 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander R. Hartloper; Alexander R. Hartloper; Selimcan Ozden; Albano de Castro e Sousa; Dimitrios G. Lignos; Dimitrios G. Lignos; Selimcan Ozden; Albano de Castro e Sousa (2022). Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials [Dataset]. http://doi.org/10.5281/zenodo.6965147
    Explore at:
    bin, zip, csvAvailable download formats
    Dataset updated
    Dec 24, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alexander R. Hartloper; Alexander R. Hartloper; Selimcan Ozden; Albano de Castro e Sousa; Dimitrios G. Lignos; Dimitrios G. Lignos; Selimcan Ozden; Albano de Castro e Sousa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials

    Background

    This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.

    The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).

    Usage

    • The data is licensed through the Creative Commons Attribution 4.0 International.
    • If you have used our data and are publishing your work, we ask that you please reference both:
      1. this database through its DOI, and
      2. any publication that is associated with the experiments. See the Overall_Summary and Database_References files for the associated publication references.

    Included Files

    • Overall_Summary_2022-08-25_v1-0-0.csv: summarises the specimen information for all experiments in the database.
    • Summarized_Mechanical_Props_Campaign_2022-08-25_v1-0-0.csv: summarises the average initial yield stress and average initial elastic modulus per campaign.
    • Unreduced_Data-#_v1-0-0.zip: contain the original (not downsampled) data
      • Where # is one of: 1, 2, 3, 4, 5, 6. The unreduced data is broken into separate archives because of upload limitations to Zenodo. Together they provide all the experimental data.
      • We recommend you un-zip all the folders and place them in one "Unreduced_Data" directory similar to the "Clean_Data"
      • The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.
      • There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the unreduced data.
      • The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.
    • Clean_Data_v1-0-0.zip: contains all the downsampled data
      • The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.
      • There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the clean data.
      • The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.
    • Database_References_v1-0-0.bib
      • Contains a bibtex reference for many of the experiments in the database. Corresponds to the "citekey" entry in the summary files.

    File Format: Downsampled Data

    These are the "LP_

    • The header of the first column is empty: the first column corresponds to the index of the sample point in the original (unreduced) data
    • Time[s]: time in seconds since the start of the test
    • e_true: true strain
    • Sigma_true: true stress in MPa
    • (optional) Temperature[C]: the surface temperature in degC

    These data files can be easily loaded using the pandas library in Python through:

    import pandas
    data = pandas.read_csv(data_file, index_col=0)

    The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.

    File Format: Unreduced Data

    These are the "LP_

    • The first column is the index of each data point
    • S/No: sample number recorded by the DAQ
    • System Date: Date and time of sample
    • Time[s]: time in seconds since the start of the test
    • C_1_Force[kN]: load cell force
    • C_1_Déform1[mm]: extensometer displacement
    • C_1_Déplacement[mm]: cross-head displacement
    • Eng_Stress[MPa]: engineering stress
    • Eng_Strain[]: engineering strain
    • e_true: true strain
    • Sigma_true: true stress in MPa
    • (optional) Temperature[C]: specimen surface temperature in degC

    The data can be loaded and used similarly to the downsampled data.

    File Format: Overall_Summary

    The overall summary file provides data on all the test specimens in the database. The columns include:

    • hidden_index: internal reference ID
    • grade: material grade
    • spec: specifications for the material
    • source: base material for the test specimen
    • id: internal name for the specimen
    • lp: load protocol
    • size: type of specimen (M8, M12, M20)
    • gage_length_mm_: unreduced section length in mm
    • avg_reduced_dia_mm_: average measured diameter for the reduced section in mm
    • avg_fractured_dia_top_mm_: average measured diameter of the top fracture surface in mm
    • avg_fractured_dia_bot_mm_: average measured diameter of the bottom fracture surface in mm
    • fy_n_mpa_: nominal yield stress
    • fu_n_mpa_: nominal ultimate stress
    • t_a_deg_c_: ambient temperature in degC
    • date: date of test
    • investigator: person(s) who conducted the test
    • location: laboratory where test was conducted
    • machine: setup used to conduct test
    • pid_force_k_p, pid_force_t_i, pid_force_t_d: PID parameters for force control
    • pid_disp_k_p, pid_disp_t_i, pid_disp_t_d: PID parameters for displacement control
    • pid_extenso_k_p, pid_extenso_t_i, pid_extenso_t_d: PID parameters for extensometer control
    • citekey: reference corresponding to the Database_References.bib file
    • yield_stress_mpa_: computed yield stress in MPa
    • elastic_modulus_mpa_: computed elastic modulus in MPa
    • fracture_strain: computed average true strain across the fracture surface
    • c,si,mn,p,s,n,cu,mo,ni,cr,v,nb,ti,al,b,zr,sn,ca,h,fe: chemical compositions in units of %mass
    • file: file name of corresponding clean (downsampled) stress-strain data

    File Format: Summarized_Mechanical_Props_Campaign

    Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,

    tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv',
              index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1],
              keep_default_na=False, na_values='')
    • citekey: reference in "Campaign_References.bib".
    • Grade: material grade.
    • Spec.: specifications (e.g., J2+N).
    • Yield Stress [MPa]: initial yield stress in MPa
      • size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign
    • Elastic Modulus [MPa]: initial elastic modulus in MPa
      • size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

    Caveats

    • The files in the following directories were tested before the protocol was established. Therefore, only the true stress-strain is available for each:
      • A500
      • A992_Gr50
      • BCP325
      • BCR295
      • HYP400
      • S460NL
      • S690QL/25mm
      • S355J2_Plates/S355J2_N_25mm and S355J2_N_50mm
  11. Z

    Newcastle polysomnography and accelerometer data

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vincent van Hees; Sarah Charman; Kirstie Anderson (2020). Newcastle polysomnography and accelerometer data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1160409
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Netherlands eScience Center, Amsterdam
    Newcastle University, Newcastle
    5. Regional Sleep Service, Freeman Hospital, Newcastle
    Authors
    Vincent van Hees; Sarah Charman; Kirstie Anderson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Newcastle PSG+Accelerometer study 2015

    This data set contains 55 .bin files, 28 .txt files, and one .csv file, which were collected in Newcastle upon Tyne (UK) to evaluate an accelerometer-based algorithm for sleep classification. The data come form a a single night polysomnography recording in 28 sleep clinic patients. A description of the experimental protocol can be found in this open access PLoSONE paper from 2015: https://doi.org/10.1371/journal.pone.0142533.

    Polysomnography

    Sleep scores derived from polysomnography are stored in the .txt files. Each file represents a time series (one night) of one participant. The resolution of the scoring is 30 seconds. Participants are numbered. The participant number is included in the file names as “mecsleep01_...”. pariticpants_info.csv is a dictionary of participant number, diagnosis, age, and sex.

    Accelerometer data

    Accelerometer data from brand GENEActiv (https://www.activinsights.com) are stored in .bin files. Per participant two accelerometers were used: One accelerometer on each wrist (left and right). The right wrist from participant 10 is missing, hence the total number of 55 bin files. The tri-axial (three axis) accelerometers were configured to record at 85.7 Hertz. The accelerometer data can be read with R package GENEAread https://cran.r-project.org/web/packages/GENEAread/index.html. Additional information on the accelerometer can be found on the manufacturers product website: https://www.activinsights.com/resources-support/geneactiv/downloads-software/, including a description of the binary file structure on page 27 of this (pdf) file: https://49wvycy00mv416l561vrj345-wpengine.netdna-ssl.com/wp-content/uploads/2014/03/geneactiv_instruction_manual_v1.2.pdf. The participant number and the body side on which the accelerometer is worn are included in the file names as “MECSLEEP01_left wrist...”.

    Participant information

    The .csv file as included in this dataset contains a dictionary of the participant numbers, sleep disorder diagnosis, participant age at the time of measurement, and sex.

    Example processing

    The code we used ourselves to process this data can be found in this GitHub repository: https://github.com/wadpac/psg-ncl-acc-spt-detection-eval. Note that we use R package GGIR: https://cran.r-project.org/web/packages/GGIR/, which calls R package GENEAread for reading the binary data.

  12. CODE-test: An annotated 12-lead ECG dataset

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jun 7, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antonio H Ribeiro; Antonio H Ribeiro; Manoel Horta Ribeiro; Manoel Horta Ribeiro; Gabriela M. Paixão; Gabriela M. Paixão; Derick M. Oliveira; Derick M. Oliveira; Paulo R. Gomes; Paulo R. Gomes; Jéssica A. Canazart; Jéssica A. Canazart; Milton P. Ferreira; Milton P. Ferreira; Carl R. Andersson; Carl R. Andersson; Peter W. Macfarlane; Peter W. Macfarlane; Wagner Meira Jr.; Wagner Meira Jr.; Thomas B. Schön; Thomas B. Schön; Antonio Luiz P. Ribeiro; Antonio Luiz P. Ribeiro (2021). CODE-test: An annotated 12-lead ECG dataset [Dataset]. http://doi.org/10.5281/zenodo.3765780
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 7, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Antonio H Ribeiro; Antonio H Ribeiro; Manoel Horta Ribeiro; Manoel Horta Ribeiro; Gabriela M. Paixão; Gabriela M. Paixão; Derick M. Oliveira; Derick M. Oliveira; Paulo R. Gomes; Paulo R. Gomes; Jéssica A. Canazart; Jéssica A. Canazart; Milton P. Ferreira; Milton P. Ferreira; Carl R. Andersson; Carl R. Andersson; Peter W. Macfarlane; Peter W. Macfarlane; Wagner Meira Jr.; Wagner Meira Jr.; Thomas B. Schön; Thomas B. Schön; Antonio Luiz P. Ribeiro; Antonio Luiz P. Ribeiro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    # Annotated 12 lead ECG dataset
    
    Contain 827 ECG tracings from different patients, annotated by several cardiologists, residents and medical students. It is used as test set on the paper: "Automatic diagnosis of the 12-lead ECG using a deep neural network". https://www.nature.com/articles/s41467-020-15432-4.
    
    It contain annotations about 6 different ECGs abnormalities:
    - 1st degree AV block (1dAVb);
    - right bundle branch block (RBBB);
    - left bundle branch block (LBBB);
    - sinus bradycardia (SB);
    - atrial fibrillation (AF); and,
    - sinus tachycardia (ST).
    
    Companion python scripts are available in:
    https://github.com/antonior92/automatic-ecg-diagnosis
    
    --------
    
    Citation
    ```
    Ribeiro, A.H., Ribeiro, M.H., Paixão, G.M.M. et al. Automatic diagnosis of the 12-lead ECG using a deep neural network. Nat Commun 11, 1760 (2020). https://doi.org/10.1038/s41467-020-15432-4
    ```
    
    Bibtex:
    ```
    @article{ribeiro_automatic_2020,
     title = {Automatic Diagnosis of the 12-Lead {{ECG}} Using a Deep Neural Network},
     author = {Ribeiro, Ant{\^o}nio H. and Ribeiro, Manoel Horta and Paix{\~a}o, Gabriela M. M. and Oliveira, Derick M. and Gomes, Paulo R. and Canazart, J{\'e}ssica A. and Ferreira, Milton P. S. and Andersson, Carl R. and Macfarlane, Peter W. and Meira Jr., Wagner and Sch{\"o}n, Thomas B. and Ribeiro, Antonio Luiz P.},
     year = {2020},
     volume = {11},
     pages = {1760},
     doi = {https://doi.org/10.1038/s41467-020-15432-4},
     journal = {Nature Communications},
     number = {1}
    }
    ```
    -----
    
    
    ## Folder content:
    
    - `ecg_tracings.hdf5`: The HDF5 file containing a single dataset named `tracings`. This dataset is a `(827, 4096, 12)` tensor. The first dimension correspond to the 827 different exams from different patients; the second dimension correspond to the 4096 signal samples; the third dimension to the 12 different leads of the ECG exams in the following order: `{DI, DII, DIII, AVR, AVL, AVF, V1, V2, V3, V4, V5, V6}`. 
    
    The signals are sampled at 400 Hz. Some signals originally have a duration of 10 seconds (10 * 400 = 4000 samples) and others of 7 seconds (7 * 400 = 2800 samples). In order to make them all have the same size (4096 samples) we fill them with zeros on both sizes. For instance, for a 7 seconds ECG signal with 2800 samples we include 648 samples at the beginning and 648 samples at the end, yielding 4096 samples that are them saved in the hdf5 dataset. All signal are represented as floating point numbers at the scale 1e-4V: so it should be multiplied by 1000 in order to obtain the signals in V.
    
    In python, one can read this file using the following sequence:
    ```python
    import h5py
    with h5py.File(args.tracings, "r") as f:
      x = np.array(f['tracings'])
    ```
    
    - The file `attributes.csv` contain basic patient attributes: sex (M or F) and age. It
    contain 827 lines (plus the header). The i-th tracing in `ecg_tracings.hdf5` correspond to the i-th line.
    - `annotations/`: folder containing annotations csv format. Each csv file contain 827 lines (plus the header). The i-th line correspond to the i-th tracing in `ecg_tracings.hdf5` correspond to the in all csv files. The csv files all have 6 columns `1dAVb, RBBB, LBBB, SB, AF, ST`
    corresponding to weather the annotator have detect the abnormality in the ECG (`=1`) or not (`=0`).
     1. `cardiologist[1,2].csv` contain annotations from two different cardiologist.
     2. `gold_standard.csv` gold standard annotation for this test dataset. When the cardiologist 1 and cardiologist 2 agree, the common diagnosis was considered as gold standard. In cases where there was any disagreement, a third senior specialist, aware of the annotations from the other two, decided the diagnosis. 
     3. `dnn.csv` prediction from the deep neural network described in the paper. THe threshold is set in such way it maximizes the F1 score.
     4. `cardiology_residents.csv` annotations from two 4th year cardiology residents (each annotated half of the dataset).
     5. `emergency_residents.csv` annotations from two 3rd year emergency residents (each annotated half of the dataset).
     6. `medical_students.csv` annotations from two 5th year medical students (each annotated half of the dataset).
    
  13. ERA-NUTS: time-series based on C3S ERA5 for European regions

    • zenodo.org
    nc, zip
    Updated Aug 4, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M. De Felice; M. De Felice; K. Kavvadias; K. Kavvadias (2022). ERA-NUTS: time-series based on C3S ERA5 for European regions [Dataset]. http://doi.org/10.5281/zenodo.2650191
    Explore at:
    zip, ncAvailable download formats
    Dataset updated
    Aug 4, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    M. De Felice; M. De Felice; K. Kavvadias; K. Kavvadias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    # ERA-NUTS (1980-2018)

    This dataset contains a set of time-series of meteorological variables based on Copernicus Climate Change Service (C3S) ERA5 reanalysis. The data files can be downloaded from here while notebooks and other files can be found on the associated Github repository.

    This data has been generated with the aim of providing hourly time-series of the meteorological variables commonly used for power system modelling and, more in general, studies on energy systems.

    An example of the analysis that can be performed with ERA-NUTS is shown in this video.

    Important: this dataset is still a work-in-progress, we will add more analysis and variables in the near-future. If you spot an error or something strange in the data please tell us sending an email or opening an Issue in the associated Github repository.

    ## Data
    The time-series have hourly/daily/monthly frequency and are aggregated following the NUTS 2016 classification. NUTS (Nomenclature of Territorial Units for Statistics) is a European Union standard for referencing the subdivisions of countries (member states, candidate countries and EFTA countries).

    This dataset contains NUTS0/1/2 time-series for the following variables obtained from the ERA5 reanalysis data (in brackets the name of the variable on the Copernicus Data Store and its unit measure):

    - t2m: 2-meter temperature (`2m_temperature`, Celsius degrees)
    - ssrd: Surface solar radiation (`surface_solar_radiation_downwards`, Watt per square meter)
    - ssrdc: Surface solar radiation clear-sky (`surface_solar_radiation_downward_clear_sky`, Watt per square meter)
    - ro: Runoff (`runoff`, millimeters)

    There are also a set of derived variables:
    - ws10: Wind speed at 10 meters (derived by `10m_u_component_of_wind` and `10m_v_component_of_wind`, meters per second)
    - ws100: Wind speed at 100 meters (derived by `100m_u_component_of_wind` and `100m_v_component_of_wind`, meters per second)
    - CS: Clear-Sky index (the ratio between the solar radiation and the solar radiation clear-sky)
    - HDD/CDD: Heating/Cooling Degree days (derived by 2-meter temperature the EUROSTAT definition.

    For each variable we have 350 599 hourly samples (from 01-01-1980 00:00:00 to 31-12-2019 23:00:00) for 34/115/309 regions (NUTS 0/1/2).

    The data is provided in two formats:

    - NetCDF version 4 (all the variables hourly and CDD/HDD daily). NOTE: the variables are stored as `int16` type using a `scale_factor` of 0.01 to minimise the size of the files.
    - Comma Separated Value ("single index" format for all the variables and the time frequencies and "stacked" only for daily and monthly)

    All the CSV files are stored in a zipped file for each variable.

    ## Methodology

    The time-series have been generated using the following workflow:

    1. The NetCDF files are downloaded from the Copernicus Data Store from the ERA5 hourly data on single levels from 1979 to present dataset
    2. The data is read in R with the climate4r packages and aggregated using the function `/get_ts_from_shp` from panas. All the variables are aggregated at the NUTS boundaries using the average except for the runoff, which consists of the sum of all the grid points within the regional/national borders.
    3. The derived variables (wind speed, CDD/HDD, clear-sky) are computed and all the CSV files are generated using R
    4. The NetCDF are created using `xarray` in Python 3.7.

    NOTE: air temperature, solar radiation, runoff and wind speed hourly data have been rounded with two decimal digits.

    ## Example notebooks

    In the folder `notebooks` on the associated Github repository there are two Jupyter notebooks which shows how to deal effectively with the NetCDF data in `xarray` and how to visualise them in several ways by using matplotlib or the enlopy package.

    There are currently two notebooks:

    - exploring-ERA-NUTS: it shows how to open the NetCDF files (with Dask), how to manipulate and visualise them.
    - ERA-NUTS-explore-with-widget: explorer interactively the datasets with [jupyter]() and ipywidgets.

    The notebook `exploring-ERA-NUTS` is also available rendered as HTML.

    ## Additional files

    In the folder `additional files`on the associated Github repository there is a map showing the spatial resolution of the ERA5 reanalysis and a CSV file specifying the number of grid points with respect to each NUTS0/1/2 region.

    ## License

    This dataset is released under CC-BY-4.0 license.

  14. GENEActiv accelerometer file related to the #120 OxWearables / stepcount...

    • zenodo.org
    • data.niaid.nih.gov
    Updated Nov 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guillaume Wattelez; Guillaume Wattelez (2024). GENEActiv accelerometer file related to the #120 OxWearables / stepcount issue [Dataset]. http://doi.org/10.5281/zenodo.14213237
    Explore at:
    Dataset updated
    Nov 25, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Guillaume Wattelez; Guillaume Wattelez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Oct 2, 2018 - Oct 6, 2018
    Description

    An example of .bin file that have an IndexError when processing.

    Consider https://github.com/OxWearables/stepcount/issues/120" target="_blank" rel="noopener">#120 OxWearables / stepcount issue for more details.

    The .csv files are 1-second epoch conversions from the .bin file and contain time, x, y, z columns. The conversion was done by:

    1. reading the .bin with the https://www.rdocumentation.org/packages/GENEAread/" target="_blank" rel="noopener">GENEAread R package.
    2. keeping only the time, x, y and z columns.
    3. saving the data.frame into a .csv file.

    The only difference between the .csv files is the column format used for the time column before saving:

    • time column in XXXXXX_....csv had a string class
    • time column in XXXXXT....csv had a "POSIXct" "POSIXt" class
  15. Z

    ERA-NUTS: meteorological time-series based on C3S ERA5 for European regions...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Aug 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M. De Felice; K. Kavvadias (2022). ERA-NUTS: meteorological time-series based on C3S ERA5 for European regions (1980-2021) [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_2650190
    Explore at:
    Dataset updated
    Aug 4, 2022
    Dataset provided by
    European Commission, Joint Research Centre (JRC)
    Authors
    M. De Felice; K. Kavvadias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ERA-NUTS (1980-2021)

    This dataset contains a set of time-series of meteorological variables based on Copernicus Climate Change Service (C3S) ERA5 reanalysis. The data files can be downloaded from here while notebooks and other files can be found on the associated Github repository.

    This data has been generated with the aim of providing hourly time-series of the meteorological variables commonly used for power system modelling and, more in general, studies on energy systems.

    An example of the analysis that can be performed with ERA-NUTS is shown in this video.

    Important: this dataset is still a work-in-progress, we will add more analysis and variables in the near-future. If you spot an error or something strange in the data please tell us sending an email or opening an Issue in the associated Github repository.

    Data

    The time-series have hourly/daily/monthly frequency and are aggregated following the NUTS 2016 classification. NUTS (Nomenclature of Territorial Units for Statistics) is a European Union standard for referencing the subdivisions of countries (member states, candidate countries and EFTA countries).

    This dataset contains NUTS0/1/2 time-series for the following variables obtained from the ERA5 reanalysis data (in brackets the name of the variable on the Copernicus Data Store and its unit measure):

    • t2m: 2-meter temperature (2m_temperature, Celsius degrees)
    • ssrd: Surface solar radiation (surface_solar_radiation_downwards, Watt per square meter)
    • ssrdc: Surface solar radiation clear-sky (surface_solar_radiation_downward_clear_sky, Watt per square meter)
    • ro: Runoff (runoff, millimeters)
    • sd: Snow depth (sd, meters)

    There are also a set of derived variables: - ws10: Wind speed at 10 meters (derived by 10m_u_component_of_wind and 10m_v_component_of_wind, meters per second) - ws100: Wind speed at 100 meters (derived by 100m_u_component_of_wind and 100m_v_component_of_wind, meters per second) - CS: Clear-Sky index (the ratio between the solar radiation and the solar radiation clear-sky) - RH: Relative Humidity (computed following Lawrence, BAMS 2005 and Alduchov & Eskridge, 1996) - HDD/CDD: Heating/Cooling Degree days (derived by 2-meter temperature the EUROSTAT definition.

    For each variable we have 367 440 hourly samples (from 01-01-1980 00:00:00 to 31-12-2021 23:00:00) for 34/115/309 regions (NUTS 0/1/2).

    The data is provided in two formats:

    • NetCDF version 4 (all the variables hourly and CDD/HDD daily). NOTE: the variables are stored as int16 type using a scale_factor to minimise the size of the files.
    • Comma Separated Value ("single index" format for all the variables and the time frequencies and "stacked" only for daily and monthly)

    All the CSV files are stored in a zipped file for each variable.

    Methodology

    The time-series have been generated using the following workflow:

    1. The NetCDF files are downloaded from the Copernicus Data Store from the ERA5 hourly data on single levels from 1979 to present dataset
    2. The data is read in R with the climate4r packages and aggregated using the function /get_ts_from_shp from panas. All the variables are aggregated at the NUTS boundaries using the average except for the runoff, which consists of the sum of all the grid points within the regional/national borders.
    3. The derived variables (wind speed, CDD/HDD, clear-sky) are computed and all the CSV files are generated using R
    4. The NetCDF are created using xarray in Python 3.8.

    Example notebooks

    In the folder notebooks on the associated Github repository there are two Jupyter notebooks which shows how to deal effectively with the NetCDF data in xarray and how to visualise them in several ways by using matplotlib or the enlopy package.

    There are currently two notebooks:

    • exploring-ERA-NUTS: it shows how to open the NetCDF files (with Dask), how to manipulate and visualise them.
    • ERA-NUTS-explore-with-widget: explorer interactively the datasets with jupyter and ipywidgets.

    The notebook exploring-ERA-NUTS is also available rendered as HTML.

    Additional files

    In the folder additional fileson the associated Github repository there is a map showing the spatial resolution of the ERA5 reanalysis and a CSV file specifying the number of grid points with respect to each NUTS0/1/2 region.

    License

    This dataset is released under CC-BY-4.0 license.

    Changelog

    2022-04-08 Added Relative Humidity (RH) 2022-03-07 Added the missing month in CDD/HDD 2022-02-08 Updated the wind speed and temperature data due to missing months.

  16. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788
Organization logo

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem

Related Article
Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
bin, application/gzip, zip, text/x-pythonAvailable download formats
Dataset updated
Aug 2, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb
License

https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

Description
Replication pack, FSE2018 submission #164:
------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
A Case Study of the PyPI Ecosystem

**Note:** link to data artifacts is already included in the paper. 
Link to the code will be included in the Camera Ready version as well.


Content description
===================

- **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
 described below
- **settings.py** - settings template for the code archive.
- **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
 This dataset only includes stats aggregated by the ecosystem (PyPI)
- **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
 statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
 themselves, which take around 2TB.
- **build_model.r, helpers.r** - R files to process the survival data 
  (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
  `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
  **dataset_full_Jan_2018.tgz**)
- **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
- LICENSE - text of GPL v3, under which this dataset is published
- INSTALL.md - replication guide (~2 pages)
Replication guide
=================

Step 0 - prerequisites
----------------------

- Unix-compatible OS (Linux or OS X)
- Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
- R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)

Depending on detalization level (see Step 2 for more details):
- up to 2Tb of disk space (see Step 2 detalization levels)
- at least 16Gb of RAM (64 preferable)
- few hours to few month of processing time

Step 1 - software
----------------

- unpack **ghd-0.1.0.zip**, or clone from gitlab:

   git clone https://gitlab.com/user2589/ghd.git
   git checkout 0.1.0
 
 `cd` into the extracted folder. 
 All commands below assume it as a current directory.
  
- copy `settings.py` into the extracted folder. Edit the file:
  * set `DATASET_PATH` to some newly created folder path
  * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
- install docker. For Ubuntu Linux, the command is 
  `sudo apt-get install docker-compose`
- install libarchive and headers: `sudo apt-get install libarchive-dev`
- (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
 Without this dependency, you might get an error on the next step, 
 but it's safe to ignore.
- install Python libraries: `pip install --user -r requirements.txt` . 
- disable all APIs except GitHub (Bitbucket and Gitlab support were
 not yet implemented when this study was in progress): edit
 `scraper/init.py`, comment out everything except GitHub support
 in `PROVIDERS`.

Step 2 - obtaining the dataset
-----------------------------

The ultimate goal of this step is to get output of the Python function 
`common.utils.survival_data()` and save it into a CSV file:

  # copy and paste into a Python console
  from common import utils
  survival_data = utils.survival_data('pypi', '2008', smoothing=6)
  survival_data.to_csv('survival_data.csv')

Since full replication will take several months, here are some ways to speedup
the process:

####Option 2.a, difficulty level: easiest

Just use the precomputed data. Step 1 is not necessary under this scenario.

- extract **dataset_minimal_Jan_2018.zip**
- get `survival_data.csv`, go to the next step

####Option 2.b, difficulty level: easy

Use precomputed longitudinal feature values to build the final table.
The whole process will take 15..30 minutes.

- create a folder `
Search
Clear search
Close search
Google apps
Main menu