18 datasets found
  1. Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A...

    • zenodo.org
    application/gzip, bin +2
    Updated Aug 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem - the dataset [Dataset]. http://doi.org/10.5281/zenodo.1297925
    Explore at:
    text/x-python, zip, bin, application/gzipAvailable download formats
    Dataset updated
    Aug 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb
    License

    https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

    Description
    Replication pack, FSE2018 submission #164:
    ------------------------------------------
    
    **Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
    A Case Study of the PyPI Ecosystem
    
    **Note:** link to data artifacts is already included in the paper. 
    Link to the code will be included in the Camera Ready version as well.
    
    
    Content description
    ===================
    
    - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
     described below
    - **settings.py** - settings template for the code archive.
    - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
     This dataset only includes stats aggregated by the ecosystem (PyPI)
    - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
     statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
     themselves, which take around 2TB.
    - **build_model.r, helpers.r** - R files to process the survival data 
      (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
      `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
      **dataset_full_Jan_2018.tgz**)
    - **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
    - LICENSE - text of GPL v3, under which this dataset is published
    - INSTALL.md - replication guide (~2 pages)
    Replication guide
    =================
    
    Step 0 - prerequisites
    ----------------------
    
    - Unix-compatible OS (Linux or OS X)
    - Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
    - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)
    
    Depending on detalization level (see Step 2 for more details):
    - up to 2Tb of disk space (see Step 2 detalization levels)
    - at least 16Gb of RAM (64 preferable)
    - few hours to few month of processing time
    
    Step 1 - software
    ----------------
    
    - unpack **ghd-0.1.0.zip**, or clone from gitlab:
    
       git clone https://gitlab.com/user2589/ghd.git
       git checkout 0.1.0
     
     `cd` into the extracted folder. 
     All commands below assume it as a current directory.
      
    - copy `settings.py` into the extracted folder. Edit the file:
      * set `DATASET_PATH` to some newly created folder path
      * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
    - install docker. For Ubuntu Linux, the command is 
      `sudo apt-get install docker-compose`
    - install libarchive and headers: `sudo apt-get install libarchive-dev`
    - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
     Without this dependency, you might get an error on the next step, 
     but it's safe to ignore.
    - install Python libraries: `pip install --user -r requirements.txt` . 
    - disable all APIs except GitHub (Bitbucket and Gitlab support were
     not yet implemented when this study was in progress): edit
     `scraper/init.py`, comment out everything except GitHub support
     in `PROVIDERS`.
    
    Step 2 - obtaining the dataset
    -----------------------------
    
    The ultimate goal of this step is to get output of the Python function 
    `common.utils.survival_data()` and save it into a CSV file:
    
      # copy and paste into a Python console
      from common import utils
      survival_data = utils.survival_data('pypi', '2008', smoothing=6)
      survival_data.to_csv('survival_data.csv')
    
    Since full replication will take several months, here are some ways to speedup
    the process:
    
    ####Option 2.a, difficulty level: easiest
    
    Just use the precomputed data. Step 1 is not necessary under this scenario.
    
    - extract **dataset_minimal_Jan_2018.zip**
    - get `survival_data.csv`, go to the next step
    
    ####Option 2.b, difficulty level: easy
    
    Use precomputed longitudinal feature values to build the final table.
    The whole process will take 15..30 minutes.
    
    - create a folder `
  2. Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    zip
    Updated Dec 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies [Dataset]. https://data.niaid.nih.gov/resources?id=dryad_w3r2280w0
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 7, 2023
    Dataset provided by
    National Institute of Allergy and Infectious Diseaseshttp://www.niaid.nih.gov/
    HIV Prevention Trials Network
    HIV Vaccine Trials Networkhttp://www.hvtn.org/
    PEPFAR
    Authors
    Dylan Westfall; Mullins James
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies. Methods This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies" Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005 For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub. The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub. The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results. Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program. To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper. Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd. Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.

  3. Data from: Data and code from: Environmental influences on drying rate of...

    • s.cnmilf.com
    • agdatacommons.nal.usda.gov
    • +2more
    Updated May 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2024). Data and code from: Environmental influences on drying rate of spray applied disinfestants from horticultural production services [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/data-and-code-from-environmental-influences-on-drying-rate-of-spray-applied-disinfestants-
    Explore at:
    Dataset updated
    May 31, 2024
    Dataset provided by
    Agricultural Research Servicehttps://www.ars.usda.gov/
    Description

    This dataset includes all the data and R code needed to reproduce the analyses in a forthcoming manuscript:Copes, W. E., Q. D. Read, and B. J. Smith. Environmental influences on drying rate of spray applied disinfestants from horticultural production services. PhytoFrontiers, DOI pending.Study description: Instructions for disinfestants typically specify a dose and a contact time to kill plant pathogens on production surfaces. A problem occurs when disinfestants are applied to large production areas where the evaporation rate is affected by weather conditions. The common contact time recommendation of 10 min may not be achieved under hot, sunny conditions that promote fast drying. This study is an investigation into how the evaporation rates of six commercial disinfestants vary when applied to six types of substrate materials under cool to hot and cloudy to sunny weather conditions. Initially, disinfestants with low surface tension spread out to provide 100% coverage and disinfestants with high surface tension beaded up to provide about 60% coverage when applied to hard smooth surfaces. Disinfestants applied to porous materials were quickly absorbed into the body of the material, such as wood and concrete. Even though disinfestants evaporated faster under hot sunny conditions than under cool cloudy conditions, coverage was reduced considerably in the first 2.5 min under most weather conditions and reduced to less than or equal to 50% coverage by 5 min. Dataset contents: This dataset includes R code to import the data and fit Bayesian statistical models using the model fitting software CmdStan, interfaced with R using the packages brms and cmdstanr. The models (one for 2022 and one for 2023) compare how quickly different spray-applied disinfestants dry, depending on what chemical was sprayed, what surface material it was sprayed onto, and what the weather conditions were at the time. Next, the statistical models are used to generate predictions and compare mean drying rates between the disinfestants, surface materials, and weather conditions. Finally, tables and figures are created. These files are included:Drying2022.csv: drying rate data for the 2022 experimental runWeather2022.csv: weather data for the 2022 experimental runDrying2023.csv: drying rate data for the 2023 experimental runWeather2023.csv: weather data for the 2023 experimental rundisinfestant_drying_analysis.Rmd: RMarkdown notebook with all data processing, analysis, and table creation codedisinfestant_drying_analysis.html: rendered output of notebookMS_figures.R: additional R code to create figures formatted for journal requirementsfit2022_discretetime_weather_solar.rds: fitted brms model object for 2022. This will allow users to reproduce the model prediction results without having to refit the model, which was originally fit on a high-performance computing clusterfit2023_discretetime_weather_solar.rds: fitted brms model object for 2023data_dictionary.xlsx: descriptions of each column in the CSV data files

  4. Data from: A dataset to model Levantine landcover and land-use change...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Dec 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Kempf; Michael Kempf (2023). A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19 [Dataset]. http://doi.org/10.5281/zenodo.10396148
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 16, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Michael Kempf; Michael Kempf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 16, 2023
    Area covered
    Levant
    Description

    Overview

    This dataset is the repository for the following paper submitted to Data in Brief:

    Kempf, M. A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19. Data in Brief (submitted: December 2023).

    The Data in Brief article contains the supplement information and is the related data paper to:

    Kempf, M. Climate change, the Arab Spring, and COVID-19 - Impacts on landcover transformations in the Levant. Journal of Arid Environments (revision submitted: December 2023).

    Description/abstract

    The Levant region is highly vulnerable to climate change, experiencing prolonged heat waves that have led to societal crises and population displacement. Since 2010, the area has been marked by socio-political turmoil, including the Syrian civil war and currently the escalation of the so-called Israeli-Palestinian Conflict, which strained neighbouring countries like Jordan due to the influx of Syrian refugees and increases population vulnerability to governmental decision-making. Jordan, in particular, has seen rapid population growth and significant changes in land-use and infrastructure, leading to over-exploitation of the landscape through irrigation and construction. This dataset uses climate data, satellite imagery, and land cover information to illustrate the substantial increase in construction activity and highlights the intricate relationship between climate change predictions and current socio-political developments in the Levant.

    Folder structure

    The main folder after download contains all data, in which the following subfolders are stored are stored as zipped files:

    “code” stores the above described 9 code chunks to read, extract, process, analyse, and visualize the data.

    “MODIS_merged” contains the 16-days, 250 m resolution NDVI imagery merged from three tiles (h20v05, h21v05, h21v06) and cropped to the study area, n=510, covering January 2001 to December 2022 and including January and February 2023.

    “mask” contains a single shapefile, which is the merged product of administrative boundaries, including Jordan, Lebanon, Israel, Syria, and Palestine (“MERGED_LEVANT.shp”).

    “yield_productivity” contains .csv files of yield information for all countries listed above.

    “population” contains two files with the same name but different format. The .csv file is for processing and plotting in R. The .ods file is for enhanced visualization of population dynamics in the Levant (Socio_cultural_political_development_database_FAO2023.ods).

    “GLDAS” stores the raw data of the NASA Global Land Data Assimilation System datasets that can be read, extracted (variable name), and processed using code “8_GLDAS_read_extract_trend” from the respective folder. One folder contains data from 1975-2022 and a second the additional January and February 2023 data.

    “built_up” contains the landcover and built-up change data from 1975 to 2022. This folder is subdivided into two subfolder which contain the raw data and the already processed data. “raw_data” contains the unprocessed datasets and “derived_data” stores the cropped built_up datasets at 5 year intervals, e.g., “Levant_built_up_1975.tif”.

    Code structure

    1_MODIS_NDVI_hdf_file_extraction.R


    This is the first code chunk that refers to the extraction of MODIS data from .hdf file format. The following packages must be installed and the raw data must be downloaded using a simple mass downloader, e.g., from google chrome. Packages: terra. Download MODIS data from after registration from: https://lpdaac.usgs.gov/products/mod13q1v061/ or https://search.earthdata.nasa.gov/search (MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V061, last accessed, 09th of October 2023). The code reads a list of files, extracts the NDVI, and saves each file to a single .tif-file with the indication “NDVI”. Because the study area is quite large, we have to load three different (spatially) time series and merge them later. Note that the time series are temporally consistent.


    2_MERGE_MODIS_tiles.R


    In this code, we load and merge the three different stacks to produce large and consistent time series of NDVI imagery across the study area. We further use the package gtools to load the files in (1, 2, 3, 4, 5, 6, etc.). Here, we have three stacks from which we merge the first two (stack 1, stack 2) and store them. We then merge this stack with stack 3. We produce single files named NDVI_final_*consecutivenumber*.tif. Before saving the final output of single merged files, create a folder called “merged” and set the working directory to this folder, e.g., setwd("your directory_MODIS/merged").


    3_CROP_MODIS_merged_tiles.R


    Now we want to crop the derived MODIS tiles to our study area. We are using a mask, which is provided as .shp file in the repository, named "MERGED_LEVANT.shp". We load the merged .tif files and crop the stack with the vector. Saving to individual files, we name them “NDVI_merged_clip_*consecutivenumber*.tif. We now produced single cropped NDVI time series data from MODIS.
    The repository provides the already clipped and merged NDVI datasets.


    4_TREND_analysis_NDVI.R


    Now, we want to perform trend analysis from the derived data. The data we load is tricky as it contains 16-days return period across a year for the period of 22 years. Growing season sums contain MAM (March-May), JJA (June-August), and SON (September-November). December is represented as a single file, which means that the period DJF (December-February) is represented by 5 images instead of 6. For the last DJF period (December 2022), the data from January and February 2023 can be added. The code selects the respective images from the stack, depending on which period is under consideration. From these stacks, individual annually resolved growing season sums are generated and the slope is calculated. We can then extract the p-values of the trend and characterize all values with high confidence level (0.05). Using the ggplot2 package and the melt function from reshape2 package, we can create a plot of the reclassified NDVI trends together with a local smoother (LOESS) of value 0.3.
    To increase comparability and understand the amplitude of the trends, z-scores were calculated and plotted, which show the deviation of the values from the mean. This has been done for the NDVI values as well as the GLDAS climate variables as a normalization technique.


    5_BUILT_UP_change_raster.R


    Let us look at the landcover changes now. We are working with the terra package and get raster data from here: https://ghsl.jrc.ec.europa.eu/download.php?ds=bu (last accessed 03. March 2023, 100 m resolution, global coverage). Here, one can download the temporal coverage that is aimed for and reclassify it using the code after cropping to the individual study area. Here, I summed up different raster to characterize the built-up change in continuous values between 1975 and 2022.


    6_POPULATION_numbers_plot.R


    For this plot, one needs to load the .csv-file “Socio_cultural_political_development_database_FAO2023.csv” from the repository. The ggplot script provided produces the desired plot with all countries under consideration.


    7_YIELD_plot.R


    In this section, we are using the country productivity from the supplement in the repository “yield_productivity” (e.g., "Jordan_yield.csv". Each of the single country yield datasets is plotted in a ggplot and combined using the patchwork package in R.


    8_GLDAS_read_extract_trend


    The last code provides the basis for the trend analysis of the climate variables used in the paper. The raw data can be accessed https://disc.gsfc.nasa.gov/datasets?keywords=GLDAS%20Noah%20Land%20Surface%20Model%20L4%20monthly&page=1 (last accessed 9th of October 2023). The raw data comes in .nc file format and various variables can be extracted using the [“^a variable name”] command from the spatraster collection. Each time you run the code, this variable name must be adjusted to meet the requirements for the variables (see this link for abbreviations: https://disc.gsfc.nasa.gov/datasets/GLDAS_CLSM025_D_2.0/summary, last accessed 09th of October 2023; or the respective code chunk when reading a .nc file with the ncdf4 package in R) or run print(nc) from the code or use names(the spatraster collection).
    Choosing one variable, the code uses the MERGED_LEVANT.shp mask from the repository to crop and mask the data to the outline of the study area.
    From the processed data, trend analysis are conducted and z-scores were calculated following the code described above. However, annual trends require the frequency of the time series analysis to be set to value = 12. Regarding, e.g., rainfall, which is measured as annual sums and not means, the chunk r.sum=r.sum/12 has to be removed or set to r.sum=r.sum/1 to avoid calculating annual mean values (see other variables). Seasonal subset can be calculated as described in the code. Here, 3-month subsets were chosen for growing seasons, e.g. March-May (MAM), June-July (JJA), September-November (SON), and DJF (December-February, including Jan/Feb of the consecutive year).
    From the data, mean values of 48 consecutive years are calculated and trend analysis are performed as describe above. In the same way, p-values are extracted and 95 % confidence level values are marked with dots on the raster plot. This analysis can be performed with a much longer time series, other variables, ad different spatial extent across the globe due to the availability of the GLDAS variables.

  5. Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic...

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, zip
    Updated Dec 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander R. Hartloper; Alexander R. Hartloper; Selimcan Ozden; Albano de Castro e Sousa; Dimitrios G. Lignos; Dimitrios G. Lignos; Selimcan Ozden; Albano de Castro e Sousa (2022). Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials [Dataset]. http://doi.org/10.5281/zenodo.6965147
    Explore at:
    bin, zip, csvAvailable download formats
    Dataset updated
    Dec 24, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alexander R. Hartloper; Alexander R. Hartloper; Selimcan Ozden; Albano de Castro e Sousa; Dimitrios G. Lignos; Dimitrios G. Lignos; Selimcan Ozden; Albano de Castro e Sousa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials

    Background

    This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.

    The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).

    Usage

    • The data is licensed through the Creative Commons Attribution 4.0 International.
    • If you have used our data and are publishing your work, we ask that you please reference both:
      1. this database through its DOI, and
      2. any publication that is associated with the experiments. See the Overall_Summary and Database_References files for the associated publication references.

    Included Files

    • Overall_Summary_2022-08-25_v1-0-0.csv: summarises the specimen information for all experiments in the database.
    • Summarized_Mechanical_Props_Campaign_2022-08-25_v1-0-0.csv: summarises the average initial yield stress and average initial elastic modulus per campaign.
    • Unreduced_Data-#_v1-0-0.zip: contain the original (not downsampled) data
      • Where # is one of: 1, 2, 3, 4, 5, 6. The unreduced data is broken into separate archives because of upload limitations to Zenodo. Together they provide all the experimental data.
      • We recommend you un-zip all the folders and place them in one "Unreduced_Data" directory similar to the "Clean_Data"
      • The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.
      • There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the unreduced data.
      • The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.
    • Clean_Data_v1-0-0.zip: contains all the downsampled data
      • The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.
      • There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the clean data.
      • The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.
    • Database_References_v1-0-0.bib
      • Contains a bibtex reference for many of the experiments in the database. Corresponds to the "citekey" entry in the summary files.

    File Format: Downsampled Data

    These are the "LP_

    • The header of the first column is empty: the first column corresponds to the index of the sample point in the original (unreduced) data
    • Time[s]: time in seconds since the start of the test
    • e_true: true strain
    • Sigma_true: true stress in MPa
    • (optional) Temperature[C]: the surface temperature in degC

    These data files can be easily loaded using the pandas library in Python through:

    import pandas
    data = pandas.read_csv(data_file, index_col=0)

    The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.

    File Format: Unreduced Data

    These are the "LP_

    • The first column is the index of each data point
    • S/No: sample number recorded by the DAQ
    • System Date: Date and time of sample
    • Time[s]: time in seconds since the start of the test
    • C_1_Force[kN]: load cell force
    • C_1_Déform1[mm]: extensometer displacement
    • C_1_Déplacement[mm]: cross-head displacement
    • Eng_Stress[MPa]: engineering stress
    • Eng_Strain[]: engineering strain
    • e_true: true strain
    • Sigma_true: true stress in MPa
    • (optional) Temperature[C]: specimen surface temperature in degC

    The data can be loaded and used similarly to the downsampled data.

    File Format: Overall_Summary

    The overall summary file provides data on all the test specimens in the database. The columns include:

    • hidden_index: internal reference ID
    • grade: material grade
    • spec: specifications for the material
    • source: base material for the test specimen
    • id: internal name for the specimen
    • lp: load protocol
    • size: type of specimen (M8, M12, M20)
    • gage_length_mm_: unreduced section length in mm
    • avg_reduced_dia_mm_: average measured diameter for the reduced section in mm
    • avg_fractured_dia_top_mm_: average measured diameter of the top fracture surface in mm
    • avg_fractured_dia_bot_mm_: average measured diameter of the bottom fracture surface in mm
    • fy_n_mpa_: nominal yield stress
    • fu_n_mpa_: nominal ultimate stress
    • t_a_deg_c_: ambient temperature in degC
    • date: date of test
    • investigator: person(s) who conducted the test
    • location: laboratory where test was conducted
    • machine: setup used to conduct test
    • pid_force_k_p, pid_force_t_i, pid_force_t_d: PID parameters for force control
    • pid_disp_k_p, pid_disp_t_i, pid_disp_t_d: PID parameters for displacement control
    • pid_extenso_k_p, pid_extenso_t_i, pid_extenso_t_d: PID parameters for extensometer control
    • citekey: reference corresponding to the Database_References.bib file
    • yield_stress_mpa_: computed yield stress in MPa
    • elastic_modulus_mpa_: computed elastic modulus in MPa
    • fracture_strain: computed average true strain across the fracture surface
    • c,si,mn,p,s,n,cu,mo,ni,cr,v,nb,ti,al,b,zr,sn,ca,h,fe: chemical compositions in units of %mass
    • file: file name of corresponding clean (downsampled) stress-strain data

    File Format: Summarized_Mechanical_Props_Campaign

    Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,

    tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv',
              index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1],
              keep_default_na=False, na_values='')
    • citekey: reference in "Campaign_References.bib".
    • Grade: material grade.
    • Spec.: specifications (e.g., J2+N).
    • Yield Stress [MPa]: initial yield stress in MPa
      • size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign
    • Elastic Modulus [MPa]: initial elastic modulus in MPa
      • size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

    Caveats

    • The files in the following directories were tested before the protocol was established. Therefore, only the true stress-strain is available for each:
      • A500
      • A992_Gr50
      • BCP325
      • BCR295
      • HYP400
      • S460NL
      • S690QL/25mm
      • S355J2_Plates/S355J2_N_25mm and S355J2_N_50mm
  6. Data from: Generalizable EHR-R-REDCap pipeline for a national...

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +2more
    zip
    Updated Jan 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 9, 2022
    Dataset provided by
    Harvard Medical School
    Massachusetts General Hospital
    Authors
    Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

    Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

    Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

    Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.

    Methods eLAB Development and Source Code (R statistical software):

    eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).

    eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.

    Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.

    The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).

    Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.

    Data Dictionary (DD)

    EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.

    Study Cohort

    This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.

    Statistical Analysis

    OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.

  7. c

    Data from: Data and code from: Buffered peptone water formulation does not...

    • s.cnmilf.com
    • agdatacommons.nal.usda.gov
    • +2more
    Updated Aug 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2024). Data and code from: Buffered peptone water formulation does not influence growth of pESI positive Salmonella serovar Infantis [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/data-and-code-from-buffered-peptone-water-formulation-does-not-influence-growth-of-pesi-po-cb914
    Explore at:
    Dataset updated
    Aug 2, 2024
    Dataset provided by
    Agricultural Research Service
    Description

    This repository contains all data and code required to reproduce the growth curve-fitting analysis from the manuscript: McMillan, E. A., Berrang, M. E., Read, Q. D., Rasamsetti, S., Richards, A. K., Shariat, N. W., & Frye, J. G. (2022). Buffered peptone water formulation does not influence growth of pESI-positive Salmonella enterica serovar Infantis. Journal of Food Protection, 100033. https://doi.org/10.1016/j.jfp.2022.100033 Manuscript abstract Salmonella enterica is a major cause of human foodborne illness and is often attributed to poultry food sources. S. enterica serovar Infantis, specifically those carrying the pESI plasmid, has become a frequently isolated serotype from poultry meat samples at processing and has caused numerous recent human infections. In 2016, the USDA Food Safety and Inspection Service changed the official sampling method for raw poultry products from BPW to using neutralizing BPW (nBPW) as the rinsing agent in order to prevent residual antimicrobial effects from acidifying and oxidizing processing aids. This change was contemporaneous to the emergence of pESI-positive ser. Infantis as a prevalent serovar in poultry, prompting some to question if nBPW could be selecting for this prevalent serovar. We performed two experiments: a comparison of ser. Infantis growth in BPW versus nBPW, and a simulation of regulatory sampling methods. We found that when inoculated into both broths, ser. Infantis initially grows slightly slower in nBPW than in BPW but little difference was seen in abundance after six hours of growth. Additionally, use of nBPW to simulate poultry rinse sample and overnight cold shipping to a regulatory lab did not affect survival or subsequent growth of ser. Infantis in BPW. We concluded that the change in USDA-FSIS methodology to include nBPW in sampling procedures has likely not affected the emergence of S. ser. Infantis as a prevalent serovar in chicken and turkey meat product samples. Contents All necessary data are in a single comma-separated file, Sal_Infantis_growth_curve_data_EAM.csv. All R code is in a single RMarkdown document, salmonella_growth_curve_fitting.Rmd. The RMarkdown contains code to read and process the data, produce exploratory plots, fit the model, do all hoc calculations with the posterior output, and produce figures and tables from the manuscript. Salmonella Infantis growth data: This is a comma-separated file containing data needed to reproduce the growth curve fitting analysis. Columns are: Strain: numerical ID of strain (see Table 1 in manuscript) Colony_Forming_Units_permL(A, B, C): columns 2-4 are three replicate measurements of colony forming units per mL taken from the same sample at the same time. Media: whether nBPW or BPW was used in the growth medium Time_hours: time in hours ranging from 0-6. RMarkdown document with all analysis code: This RMarkdown document contains code to read and process the data, produce exploratory plots, fit the model, do all hoc calculations with the posterior output, and produce figures and tables from the manuscript. Software versions This was run on Windows 10, R version 4.1.2. Models were fit using CmdStan version 2.28.2, with brms version 2.17.0, cmdstanr version 0.4.0, emmeans version 1.7.3, and tidybayes version 3.0.2. Program information National Program: Food Safety (108) Project Plan Number: 6040-32000-085-000-D Resources in this dataset:Resource Title: Salmonella Infantis growth data. File Name: Sal_Infantis_growth_curve_data_EAM.csvResource Title: RMarkdown document with all analysis code. File Name: salmonella_growth_curve_fitting.Rmd

  8. a

    TMS daily traffic counts CSV

    • hub.arcgis.com
    • opendata-nzta.opendata.arcgis.com
    Updated Aug 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Waka Kotahi (2020). TMS daily traffic counts CSV [Dataset]. https://hub.arcgis.com/datasets/9cb86b342f2d4f228067a7437a7f7313
    Explore at:
    Dataset updated
    Aug 30, 2020
    Dataset authored and provided by
    Waka Kotahi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    You can also access an API version of this dataset.

    TMS

    (traffic monitoring system) daily-updated traffic counts API

    Important note: due to the size of this dataset, you won't be able to open it fully in Excel. Use notepad / R / any software package which can open more than a million rows.

    Data reuse caveats: as per license.

    Data quality

    statement: please read the accompanying user manual, explaining:

    how

     this data is collected identification 
    
     of count stations traffic 
    
     monitoring technology monitoring 
    
     hierarchy and conventions typical 
    
     survey specification data 
    
     calculation TMS 
    
     operation. 
    

    Traffic

    monitoring for state highways: user manual

    [PDF 465 KB]

    The data is at daily granularity. However, the actual update

    frequency of the data depends on the contract the site falls within. For telemetry

    sites it's once a week on a Wednesday. Some regional sites are fortnightly, and

    some monthly or quarterly. Some are only 4 weeks a year, with timing depending

    on contractors’ programme of work.

    Data quality caveats: you must use this data in

    conjunction with the user manual and the following caveats.

    The

     road sensors used in data collection are subject to both technical errors and 
    
     environmental interference.Data 
    
     is compiled from a variety of sources. Accuracy may vary and the data 
    
     should only be used as a guide.As 
    
     not all road sections are monitored, a direct calculation of Vehicle 
    
     Kilometres Travelled (VKT) for a region is not possible.Data 
    
     is sourced from Waka Kotahi New Zealand Transport Agency TMS data.For 
    
     sites that use dual loops classification is by length. Vehicles with a length of less than 5.5m are 
    
     classed as light vehicles. Vehicles over 11m long are classed as heavy 
    
     vehicles. Vehicles between 5.5 and 11m are split 50:50 into light and 
    
     heavy.In September 2022, the National Telemetry contract was handed to a new contractor. During the handover process, due to some missing documents and aged technology, 40 of the 96 national telemetry traffic count sites went offline. Current contractor has continued to upload data from all active sites and have gradually worked to bring most offline sites back online. Please note and account for possible gaps in data from National Telemetry Sites. 
    

    The NZTA Vehicle

    Classification Relationships diagram below shows the length classification (typically dual loops) and axle classification (typically pneumatic tube counts),

    and how these map to the Monetised benefits and costs manual, table A37,

    page 254.

    Monetised benefits and costs manual [PDF 9 MB]

    For the full TMS

    classification schema see Appendix A of the traffic counting manual vehicle

    classification scheme (NZTA 2011), below.

    Traffic monitoring for state highways: user manual [PDF 465 KB]

    State highway traffic monitoring (map)

    State highway traffic monitoring sites

  9. Storage and Transit Time Data and Code

    • zenodo.org
    zip
    Updated Oct 29, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrew Felton; Andrew Felton (2024). Storage and Transit Time Data and Code [Dataset]. http://doi.org/10.5281/zenodo.14009758
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 29, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrew Felton; Andrew Felton
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Author: Andrew J. Felton
    Date: 10/29/2024

    This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:

    "Global estimates of the storage and transit time of water through vegetation"

    Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated.

    Data information:

    The data folder contains key data sets used for analysis. In particular:

    "data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.

    #Code information

    Python scripts can be found in the "supporting_code" folder.

    Each R script in this project has a role:

    "01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).

    "02_functions.R": This script contains custom functions. Load this using the
    `source()` function in the 01_start.R script.

    "03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
    `source()` function in the 01_start.R script.

    "04_figures_tables.R": This is the main workhouse for figure/table production and
    supporting analyses. This script generates the key figures and summary statistics
    used in the study that then get saved in the manuscript_figures folder. Note that all
    maps were produced using Python code found in the "supporting_code"" folder.

    "supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.

    "supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.

  10. Z

    [Dataset] Does Volunteer Engagement Pay Off? An Analysis of User...

    • data.niaid.nih.gov
    Updated Nov 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    H. Ulrich Hoppe (2022). [Dataset] Does Volunteer Engagement Pay Off? An Analysis of User Participation in Online Citizen Science Projects [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7357746
    Explore at:
    Dataset updated
    Nov 28, 2022
    Dataset provided by
    Nicolás Felipe Gutiérrez-Páez
    Simon Krukowski
    H. Ulrich Hoppe
    Ishari Amarasinghe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Explanation/Overview:

    Corresponding dataset for the analyses and results achieved in the CS Track project in the research line on participation analyses, which is also reported in the publication "Does Volunteer Engagement Pay Off? An Analysis of User Participation in Online Citizen Science Projects", a conference paper for the conference CollabTech 2022: Collaboration Technologies and Social Computing and published as part of the Lecture Notes in Computer Science book series (LNCS,volume 13632) here. The usernames have been anonymised.

    Purpose:

    The purpose of this dataset is to provide the basis to reproduce the results reported in the associated deliverable, and in the above-mentioned publication. As such, it does not represent raw data, but rather files that already include certain analysis steps (like calculated degrees or other SNA-related measures), ready for analysis, visualisation and interpretation with R.

    Relatedness:

    The data of the different projects was derived from the forums of 7 Zooniverse projects based on similar discussion board features. The projects are: 'Galaxy Zoo', 'Gravity Spy', 'Seabirdwatch', 'Snapshot Wisconsin', 'Wildwatch Kenya', 'Galaxy Nurseries', 'Penguin Watch'.

    Content:

    In this Zenodo entry, several files can be found. The structure is as follows (files and folders and descriptions).

    corresponding_calculations.html

    Quarto-notebook to view in browser

    corresponding_calculations.qmd

    Quarto-notebook to view in RStudio

    assets

    data

    annotations

    annotations.csv

    List of annotations made per day for each of the analysed projects

    comments

    comments.csv

    Total list of comments with several data fields (i.e., comment id, text, reply_user_id)

    rolechanges

    478_rolechanges.csv

    List of roles per user to determine number of role changes

    1104_rolechanges.csv

    ...

    ...

    totalnetworkdata

    Edges

    478_edges.csv

    Network data (edge set) for the given projects (without time slices)

    1104_edges.csv

    ...

    ...

    Nodes

    478_nodes.csv

    Network data (node set) for the given projects (without time slices)

    1104_nodes.csv

    ...

    ...

    trajectories

    Network data (edge and node sets) for the given projects and all time slices (Q1 2016 - Q4 2021)

    478

    Edges

    edges_4782016_q1.csv

    edges_4782016_q2.csv

    edges_4782016_q3.csv

    edges_4782016_q4.csv

    ...

    Nodes

    nodes_4782016_q1.csv

    nodes_4782016_q4.csv

    nodes_4782016_q3.csv

    nodes_4782016_q2.csv

    ...

    1104

    Edges

    ...

    Nodes

    ...

    ...

    scripts

    datavizfuncs.R

    script for the data visualisation functions, automatically executed from within corresponding_calculations.qmd

    import.R

    script for the import of data, automatically executed from within corresponding_calculations.qmd

    corresponding_calculations_files

    files for the html/qmd view in the browser/RStudio

    Grouping:

    The data is grouped according to given criteria (e.g., project_title or time). Accordingly, the respective files can be found in the data structure

  11. d

    R-LOADEST files to produce results in the Heart River Basin, North Dakota,...

    • catalog.data.gov
    • data.usgs.gov
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). R-LOADEST files to produce results in the Heart River Basin, North Dakota, 1970-2020 [Dataset]. https://catalog.data.gov/dataset/r-loadest-files-to-produce-results-in-the-heart-river-basin-north-dakota-1970-2020
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Heart River, North Dakota
    Description

    This child page contains a zipped folder which contains all of the items necessary to run load estimation using R-LOADEST to produce results that are published in U.S. Geological Survey Investigations Report 2021-XXXX [Tatge, W.S., Nustad, R.A., and Galloway, J.M., 2021, Evaluation of Salinity and Nutrient Conditions in the Heart River Basin, North Dakota, 1970-2020: U.S. Geological Survey Scientific Investigations Report 2021-XXXX, XX p]. The folder contains an allsiteinfo.table.csv file, a "datain" folder, and a "scripts" folder. The allsiteinfo.table.csv file can be used to cross reference the sites with the main report (Tatge and others, 2021). The "datain" folder contains all the input data necessary to reproduce the load estimation results. The naming convention in the "datain" folder is site_MI_rloadest or site_NUT_rloadest for either the major ion loads or the nutrient loads. The .Rdata files are used in the scripts to run the estimations and the .csv files can be used to look at the data. The "scripts" folder contains the written R scripts to produce the results of the load estimation from the main report. R-LOADEST is a software package for analyzing loads in streams and an accompanying report (Runkel and others, 2004) serves as the formal documentation for R-LOADEST. The package is a collection of functions written in R (R Development Core Team, 2019), an open source language and a general environment for statistical computing and graphics. The following system requirements are necessary for producing results: Windows 10 operating system R (version 3.4 or later; 64-bit recommended) RStudio (version 1.1.456 or later) R-LOADEST program (available at https://github.com/USGS-R/rloadest). Runkel, R.L., Crawford, C.G., and Cohn, T.A., 2004, Load Estimator (LOADEST): A FORTRAN Program for Estimating Constituent Loads in Streams and Rivers: U.S. Geological Survey Techniques and Methods Book 4, Chapter A5, 69 p., [Also available at https://pubs.usgs.gov/tm/2005/tm4A5/pdf/508final.pdf.] R Development Core Team, 2019, R—A language and environment for statistical computing: Vienna, Austria, R Foundation for Statistical Computing, accessed December 7, 2020, at https://www.r-project.org.

  12. Data from: Time Series from Smart Meters

    • zenodo.org
    • research.science.eus
    • +1more
    Updated Feb 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cruz Enrique Borges Hernandez; Cruz Enrique Borges Hernandez; Carlos Quesada-Granja; Carlos Quesada-Granja (2025). Time Series from Smart Meters [Dataset]. http://doi.org/10.5281/zenodo.4455198
    Explore at:
    Dataset updated
    Feb 22, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Cruz Enrique Borges Hernandez; Cruz Enrique Borges Hernandez; Carlos Quesada-Granja; Carlos Quesada-Granja
    Description

    This file contains technical problems that make it insuitable for public use. Please use this dataset instead.

    • Name: Time Series from Smart Meters
    • Summary: The dataset contains: (1) raw and cleaned time series of smart meters from Spanish electric cooperatives, and (2) feature values and metadata extracted from residential load profiles, both from Spanish electric cooperatives and publicly available datasets.
    • License: CC BY-NC-SA
    • Acknowledge: These data have been collected in the framework of the WHY project. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 891943.
    • Disclaimer: The sole responsibility for the content of this publication lies with the authors. It does not necessarily reflect the opinion of the Executive Agency for Small and Medium-sized Enterprises (EASME) or the European Commission (EC). EASME or the EC are not responsible for any use that may be made of the information contained therein.
    • Collection Date: The time series on Spanish electric cooperatives were collected between 2014 and 2021. Data on feature values and metadata were extracted between 2020 and 2021.
    • Publication Date:
    • DOI: 10.5281/zenodo.4455198
    • Other repositories:
    • Author: University of Deusto
    • Objective of collection: This data is collected originally for invoice of the electrical consumption. In this project will be used to segment the households.
    • Description: The dataset contains a CSV file with one entry for both each load profile of the Spanish electric cooperatives and publicly available load profiles. The fields that can be found for each entry are (1) metadata such as the originating dataset to which they belong, the start and end dates, number of days that have been imputed and/or extended, provenance (country, administrative division, municipality, zip code, origin identifier (households, businesses, industry, etc. ), classification of socioeconomic conditions, and tariffs; and (2) extracted features, grouped by types such as statistical moments, quantiles, lag d-day autocorrelations, seasonal aggregates, peak and off-peak periods, load factors, energy consumed, features obtained using the R package "tsfeatures", and Catch-22 features. In addition, the dataset also contains a rData file for each entry of the Spanish electric cooperatives. These files contain the original time series (timestamped values), a field indicating whether the signal has been imputed and/or extended and another field indicating the date of the extension.
    • 5 star: ⭐⭐⭐
    • Preprocessing steps: anonymization, data fusion, imputation of gaps, extension of time series shorter than 800 days, computation of features.
    • Reuse: NA
    • Update policy: The data will be updated throughout 2021.
    • Ethics and legal aspects: Spanish electric cooperative data contains the CUPS (Meter Point Administration Number), which is personal data. A pre-processing step has been carried out to substitute the CUPS by a MD5 hash.
    • Technical aspects: Decompressed data is quite large (big data).
    • Other:
  13. CODE-test: An annotated 12-lead ECG dataset

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jun 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antonio H Ribeiro; Antonio H Ribeiro; Manoel Horta Ribeiro; Manoel Horta Ribeiro; Gabriela M. Paixão; Gabriela M. Paixão; Derick M. Oliveira; Derick M. Oliveira; Paulo R. Gomes; Paulo R. Gomes; Jéssica A. Canazart; Jéssica A. Canazart; Milton P. Ferreira; Milton P. Ferreira; Carl R. Andersson; Carl R. Andersson; Peter W. Macfarlane; Peter W. Macfarlane; Wagner Meira Jr.; Wagner Meira Jr.; Thomas B. Schön; Thomas B. Schön; Antonio Luiz P. Ribeiro; Antonio Luiz P. Ribeiro (2021). CODE-test: An annotated 12-lead ECG dataset [Dataset]. http://doi.org/10.5281/zenodo.3765780
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 7, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Antonio H Ribeiro; Antonio H Ribeiro; Manoel Horta Ribeiro; Manoel Horta Ribeiro; Gabriela M. Paixão; Gabriela M. Paixão; Derick M. Oliveira; Derick M. Oliveira; Paulo R. Gomes; Paulo R. Gomes; Jéssica A. Canazart; Jéssica A. Canazart; Milton P. Ferreira; Milton P. Ferreira; Carl R. Andersson; Carl R. Andersson; Peter W. Macfarlane; Peter W. Macfarlane; Wagner Meira Jr.; Wagner Meira Jr.; Thomas B. Schön; Thomas B. Schön; Antonio Luiz P. Ribeiro; Antonio Luiz P. Ribeiro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    # Annotated 12 lead ECG dataset
    
    Contain 827 ECG tracings from different patients, annotated by several cardiologists, residents and medical students. It is used as test set on the paper: "Automatic diagnosis of the 12-lead ECG using a deep neural network". https://www.nature.com/articles/s41467-020-15432-4.
    
    It contain annotations about 6 different ECGs abnormalities:
    - 1st degree AV block (1dAVb);
    - right bundle branch block (RBBB);
    - left bundle branch block (LBBB);
    - sinus bradycardia (SB);
    - atrial fibrillation (AF); and,
    - sinus tachycardia (ST).
    
    Companion python scripts are available in:
    https://github.com/antonior92/automatic-ecg-diagnosis
    
    --------
    
    Citation
    ```
    Ribeiro, A.H., Ribeiro, M.H., Paixão, G.M.M. et al. Automatic diagnosis of the 12-lead ECG using a deep neural network. Nat Commun 11, 1760 (2020). https://doi.org/10.1038/s41467-020-15432-4
    ```
    
    Bibtex:
    ```
    @article{ribeiro_automatic_2020,
     title = {Automatic Diagnosis of the 12-Lead {{ECG}} Using a Deep Neural Network},
     author = {Ribeiro, Ant{\^o}nio H. and Ribeiro, Manoel Horta and Paix{\~a}o, Gabriela M. M. and Oliveira, Derick M. and Gomes, Paulo R. and Canazart, J{\'e}ssica A. and Ferreira, Milton P. S. and Andersson, Carl R. and Macfarlane, Peter W. and Meira Jr., Wagner and Sch{\"o}n, Thomas B. and Ribeiro, Antonio Luiz P.},
     year = {2020},
     volume = {11},
     pages = {1760},
     doi = {https://doi.org/10.1038/s41467-020-15432-4},
     journal = {Nature Communications},
     number = {1}
    }
    ```
    -----
    
    
    ## Folder content:
    
    - `ecg_tracings.hdf5`: The HDF5 file containing a single dataset named `tracings`. This dataset is a `(827, 4096, 12)` tensor. The first dimension correspond to the 827 different exams from different patients; the second dimension correspond to the 4096 signal samples; the third dimension to the 12 different leads of the ECG exams in the following order: `{DI, DII, DIII, AVR, AVL, AVF, V1, V2, V3, V4, V5, V6}`. 
    
    The signals are sampled at 400 Hz. Some signals originally have a duration of 10 seconds (10 * 400 = 4000 samples) and others of 7 seconds (7 * 400 = 2800 samples). In order to make them all have the same size (4096 samples) we fill them with zeros on both sizes. For instance, for a 7 seconds ECG signal with 2800 samples we include 648 samples at the beginning and 648 samples at the end, yielding 4096 samples that are them saved in the hdf5 dataset. All signal are represented as floating point numbers at the scale 1e-4V: so it should be multiplied by 1000 in order to obtain the signals in V.
    
    In python, one can read this file using the following sequence:
    ```python
    import h5py
    with h5py.File(args.tracings, "r") as f:
      x = np.array(f['tracings'])
    ```
    
    - The file `attributes.csv` contain basic patient attributes: sex (M or F) and age. It
    contain 827 lines (plus the header). The i-th tracing in `ecg_tracings.hdf5` correspond to the i-th line.
    - `annotations/`: folder containing annotations csv format. Each csv file contain 827 lines (plus the header). The i-th line correspond to the i-th tracing in `ecg_tracings.hdf5` correspond to the in all csv files. The csv files all have 6 columns `1dAVb, RBBB, LBBB, SB, AF, ST`
    corresponding to weather the annotator have detect the abnormality in the ECG (`=1`) or not (`=0`).
     1. `cardiologist[1,2].csv` contain annotations from two different cardiologist.
     2. `gold_standard.csv` gold standard annotation for this test dataset. When the cardiologist 1 and cardiologist 2 agree, the common diagnosis was considered as gold standard. In cases where there was any disagreement, a third senior specialist, aware of the annotations from the other two, decided the diagnosis. 
     3. `dnn.csv` prediction from the deep neural network described in the paper. THe threshold is set in such way it maximizes the F1 score.
     4. `cardiology_residents.csv` annotations from two 4th year cardiology residents (each annotated half of the dataset).
     5. `emergency_residents.csv` annotations from two 3rd year emergency residents (each annotated half of the dataset).
     6. `medical_students.csv` annotations from two 5th year medical students (each annotated half of the dataset).
    
  14. u

    Data from: Data and code from: Stem borer herbivory dependent on...

    • agdatacommons.nal.usda.gov
    • datasets.ai
    • +1more
    txt
    Updated Jul 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hannah J. Penn; Quentin Read (2024). Data and code from: Stem borer herbivory dependent on interactions of sugarcane variety, associated traits, and presence of prior borer damage [Dataset]. http://doi.org/10.15482/USDA.ADC/1529826
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jul 9, 2024
    Dataset provided by
    Ag Data Commons
    Authors
    Hannah J. Penn; Quentin Read
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    This dataset contains all the data and code needed to reproduce the analyses in the manuscript: Penn, H. J., & Read, Q. D. (2023). Stem borer herbivory dependent on interactions of sugarcane variety, associated traits, and presence of prior borer damage. Pest Management Science. https://doi.org/10.1002/ps.7843 Included are two .Rmd notebooks containing all code required to reproduce the analyses in the manuscript, two .html file of rendered notebook output, three .csv data files that are loaded and analyzed, and a .zip file of intermediate R objects that are generated during the model fitting and variable selection process. Notebook files

    01_boring_analysis.Rmd: This RMarkdown notebook contains R code to read and process the raw data, create exploratory data visualizations and tables, fit a Bayesian generalized linear mixed model, extract output from the statistical model, and create graphs and tables summarizing the model output including marginal means for different varieties and contrasts between crop years. 02_trait_covariate_analysis.Rmd: This RMarkdown notebook contains R code to read raw variety-level trait data, perform feature selection based on correlations between traits, fit another generalized linear mixed model using traits as predictors, and create graphs and tables from that model output including marginal means by categorical trait and marginal trends by continuous trait.

    HTML files These HTML files contain the rendered output of the two RMarkdown notebooks. They were generated by Quentin Read on 2023-08-30 and 2023-08-15.

    01_boring_analysis.html 02_trait_covariate_analysis.html

    CSV data files These files contain the raw data. To recreate the notebook output the CSV files should be at the file path project/data/ relative to where the notebook is run. Columns are described below.

    BoredInternodes_26April2022_no format.csv: primary data file with sugarcane borer (SCB) damage

    Columns A-C are the year, date, and location. All location values are the same. Column D identifies which experiment the data point was collected from. Column E, Stubble, indicates the crop year (plant cane or first stubble) Column F indicates the variety Column G indicates the plot (integer ID) Column H indicates the stalk within each plot (integer ID) Column I, # Internodes, indicates how many internodes were on the stalk Columns J-AM are numbered 1-30 and indicate whether SCB damage was observed on that internode (0 if no, 1 if yes, blank cell if that internode was not present on the stalk) Column AN indicates the experimental treatment for those rows that are part of a manipulative experiment Column AO contains notes

    variety_lookup.csv: summary information for the 16 varieties analyzed in this study

    Column A is the variety name Column B is the total number of stalks assessed for SCB damage for that variety across all years Column C is the number of years that variety is present in the data Column D, Stubble, indicates which crop years were sampled for that variety ("PC" if only plant cane, "PC, 1S" if there are data for both plant cane and first stubble crop years) Column E, SCB resistance, is a categorical designation with four values: susceptible, moderately susceptible, moderately resistant, resistant Column F is the literature reference for the SCB resistance value

    Select_variety_traits_12Dec2022.csv: variety-level traits for the 16 varieties analyzed in this study

    Column A is the variety name Column B is the SCB resistance designation as an integer Column C is the categorical SCB resistance designation (see above) Columns D-I are continuous traits from year 1 (plant cane), including sugar (Mg/ha), biomass or aboveground cane production (Mg/ha), TRS or theoretically recoverable sugar (g/kg), stalk weight of individual stalks (kg), stalk population density (stalks/ha), and fiber content of stalk (percent). Columns J-O are the same continuous traits from year 2 (first stubble) Columns P-V are categorical traits (in some cases continuous traits binned into categories): maturity timing, amount of stalk wax, amount of leaf sheath wax, amount of leaf sheath hair, tightness of leaf sheath, whether leaf sheath becomes necrotic with age, and amount of collar hair.

    ZIP file of intermediate R objects To recreate the notebook output without having to run computationally intensive steps, unzip the archive. The fitted model objects should be at the file path project/ relative to where the notebook is run.

    intermediate_R_objects.zip: This file contains intermediate R objects that are generated during the model fitting and variable selection process. You may use the R objects in the .zip file if you would like to reproduce final output including figures and tables without having to refit the computationally intensive statistical models.

    binom_fit_intxns_updated_only5yrs.rds: fitted brms model object for the main statistical model binom_fit_reduced.rds: fitted brms model object for the trait covariate analysis marginal_trends.RData: calculated values of the estimated marginal trends with respect to year and previous damage marginal_trend_trs.rds: calculated values of the estimated marginal trend with respect to TRS marginal_trend_fib.rds: calculated values of the estimated marginal trend with respect to fiber content Resources in this dataset:Resource Title: Sugarcane borer damage data by internode, 1993-2021. File Name: BoredInternodes_26April2022_no format.csvResource Title: Summary information for the 16 sugarcane varieties analyzed. File Name: variety_lookup.csvResource Title: Variety-level traits for the 16 sugarcane varieties analyzed. File Name: Select_variety_traits_12Dec2022.csvResource Title: RMarkdown notebook 2: trait covariate analysis. File Name: 02_trait_covariate_analysis.RmdResource Title: Rendered HTML output of notebook 2. File Name: 02_trait_covariate_analysis.htmlResource Title: RMarkdown notebook 1: main analysis. File Name: 01_boring_analysis.RmdResource Title: Rendered HTML output of notebook 1. File Name: 01_boring_analysis.htmlResource Title: Intermediate R objects. File Name: intermediate_R_objects.zip

  15. c

    Datasets for assessment of nutrient load estimation approaches for small...

    • s.cnmilf.com
    • data.usgs.gov
    • +3more
    Updated Aug 8, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Datasets for assessment of nutrient load estimation approaches for small urban streams in Durham, North Carolina, 2009-2020 [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/datasets-for-assessment-of-nutrient-load-estimation-approaches-for-small-urban-stream-2009
    Explore at:
    Dataset updated
    Aug 8, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    North Carolina, Durham
    Description

    In cooperation with the City of Durham Public Works Department Stormwater Division, the U.S. Geological Survey (USGS) conducted a study to evaluate whether alternate monitoring strategies that incorporated samples collected across an increased range of streamflows would improve nutrient load estimates for Ellerbe and Sandy Creeks, two small, highly urbanized streams in the City of Durham, North Carolina. This data release provides the associated datasets described in the Scientific Investigations Report, "Assessment of Nutrient Load Estimation Approaches for Small Urban Streams in Durham, North Carolina". Water-quality and streamflow data collected between January 2009 and December 2020 were used to develop instream nutrient-load models using the U.S. Geological Survey R-LOADEST program (Runkel and others, 2004; Runkel, 2013; Lorenz and others, 2017). The datasets contain water-quality data, streamflow data, input files for model calibration and prediction, and output files for model diagnostic and load estimation results. This data release contains 13 files: 1. "Ellerbe_water_quality_data.csv": This file contains water-quality data for samples collected by the City of Durham and USGS at Ellerbe Creek. 2. "Sandy_water_quality_data.csv": This file contains water-quality data for samples collected by the City of Durham and USGS at Sandy Creek. 3. "Ellerbe_streamflow_data.csv": This file contains USGS daily mean streamflow data for Ellerbe Creek. 4. "Sandy_streamflow_data.csv": This file contains USGS daily mean streamflow data for Sandy Creek. 5. "Ellerbe_input_calibration.zip": This zipped file contains eight individual model calibration input .csv files for nutrient load testing scenarios for Ellerbe Creek. 6. "Sandy_input_calibration.zip": This zipped file contains ten individual model calibration input .csv files for nutrient load testing scenarios for Sandy Creek. 7. "Ellerbe_input_prediction.zip": This zipped file contains four individual model prediction input .csv files for nutrient load testing scenarios for Ellerbe Creek. 8. "Sandy_input_prediction.zip": This zipped file contains five individual model prediction input .csv files for nutrient load testing scenarios for Sandy Creek. 9. "Ellerbe_model_diagnostic_results.csv": This file contains the diagnostic results for all nutrient regression models examined for Ellerbe Creek. 10. "Sandy_model_diagnostic_results.csv": This file contains the diagnostic results for all nutrient regression models examined for Sandy Creek. 11. "Ellerbe_load_estimation_results.csv": This file contains the estimated load results for all nutrient regression models examined for Ellerbe Creek. 12. "Sandy_load_estimation_results.csv": This file contains the estimated load results for all nutrient regression models examined for Sandy Creek. 13. "Readme.pdf": This document describes the datasets compiled for assessing nutrient load estimation approaches for Ellerbe and Sandy Creeks in the City of Durham, North Carolina.

  16. Data from: Data and code from: Cover crop and crop rotation effects on...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    Updated Aug 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2024). Data and code from: Cover crop and crop rotation effects on tissue and soil population dynamics of Macrophomina phaseolina and yield in no-till system - V2 [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-cover-crop-and-crop-rotation-effects-on-tissue-and-soil-population-dyna-831b9
    Explore at:
    Dataset updated
    Aug 2, 2024
    Dataset provided by
    Agricultural Research Servicehttps://www.ars.usda.gov/
    Description

    [Note 2023-08-14 - Supersedes version 1, https://doi.org/10.15482/USDA.ADC/1528086 ] This dataset contains all code and data necessary to reproduce the analyses in the manuscript: Mengistu, A., Read, Q. D., Sykes, V. R., Kelly, H. M., Kharel, T., & Bellaloui, N. (2023). Cover crop and crop rotation effects on tissue and soil population dynamics of Macrophomina phaseolina and yield under no-till system. Plant Disease. https://doi.org/10.1094/pdis-03-23-0443-re The .zip archive cropping-systems-1.0.zip contains data and code files. Data stem_soil_CFU_by_plant.csv: Soil disease load (SoilCFUg) and stem tissue disease load (StemCFUg) for individual plants in CFU per gram, with columns indicating year, plot ID, replicate, row, plant ID, previous crop treatment, cover crop treatment, and comments. Missing data are indicated with . yield_CFU_by_plot.csv: Yield data (YldKgHa) at the plot level in units of kg/ha, with columns indicating year, plot ID, replicate, and treatments, as well as means of soil and stem disease load at the plot level. Code cropping_system_analysis_v3.0.Rmd: RMarkdown notebook with all data processing, analysis, and visualization code equations.Rmd: RMarkdown notebook with formatted equations formatted_figs_revision.R: R script to produce figures formatted exactly as they appear in the manuscript The Rproject file cropping-systems.Rproj is used to organize the RStudio project. Scripts and notebooks used in older versions of the analysis are found in the testing/ subdirectory. Excel spreadsheets containing raw data from which the cleaned CSV files were created are found in the raw_data subdirectory.

  17. Annotated 12 lead ECG dataset

    • zenodo.org
    zip
    Updated Jun 7, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antonio H Ribeiro; Antonio H Ribeiro; Manoel Horta Ribeiro; Manoel Horta Ribeiro; Gabriela M. Paixão; Gabriela M. Paixão; Derick M. Oliveira; Derick M. Oliveira; Paulo R. Gomes; Paulo R. Gomes; Jéssica A. Canazart; Jéssica A. Canazart; Milton P. Ferreira; Milton P. Ferreira; Carl R. Andersson; Carl R. Andersson; Peter W. Macfarlane; Peter W. Macfarlane; Wagner Meira Jr.; Wagner Meira Jr.; Thomas B. Schön; Thomas B. Schön; Antonio Luiz P. Ribeiro; Antonio Luiz P. Ribeiro (2021). Annotated 12 lead ECG dataset [Dataset]. http://doi.org/10.5281/zenodo.3625007
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 7, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Antonio H Ribeiro; Antonio H Ribeiro; Manoel Horta Ribeiro; Manoel Horta Ribeiro; Gabriela M. Paixão; Gabriela M. Paixão; Derick M. Oliveira; Derick M. Oliveira; Paulo R. Gomes; Paulo R. Gomes; Jéssica A. Canazart; Jéssica A. Canazart; Milton P. Ferreira; Milton P. Ferreira; Carl R. Andersson; Carl R. Andersson; Peter W. Macfarlane; Peter W. Macfarlane; Wagner Meira Jr.; Wagner Meira Jr.; Thomas B. Schön; Thomas B. Schön; Antonio Luiz P. Ribeiro; Antonio Luiz P. Ribeiro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    # Annotated 12 lead ECG dataset
    
    Contain 827 ECG tracings from different patients, annotated by several cardiologists, residents and medical students.
    It is used as test set on the paper:
    "Automatic Diagnosis of the Short-Duration12-Lead ECG using a Deep Neural Network".
    
    It contain annotations about 6 different ECGs abnormalities:
    - 1st degree AV block (1dAVb);
    - right bundle branch block (RBBB);
    - left bundle branch block (LBBB);
    - sinus bradycardia (SB);
    - atrial fibrillation (AF); and,
    - sinus tachycardia (ST).
    
    ## Folder content:
    
    - `ecg_tracings.hdf5`: HDF5 file containing a single dataset named `tracings`. This dataset is a 
    `(827, 4096, 12)` tensor. The first dimension correspond to the 827 different exams from different 
    patients; the second dimension correspond to the 4096 signal samples; the third dimension to the 12
    different leads of the ECG exam. 
    
    The signals are sampled at 400 Hz. Some signals originally have a duration of 
    10 seconds (10 * 400 = 4000 samples) and others of 7 seconds (7 * 400 = 2800 samples).
    In order to make them all have the same size (4096 samples) we fill them with zeros
    on both sizes. For instance, for a 7 seconds ECG signal with 2800 samples we include 648
    samples at the beginning and 648 samples at the end, yielding 4096 samples that are them saved
    in the hdf5 dataset. All signal are represented as floating point numbers at the scale 1e-4V: so it should
    be multiplied by 1000 in order to obtain the signals in V.
    
    In python, one can read this file using the following sequence:
    ```python
    import h5py
    with h5py.File(args.tracings, "r") as f:
      x = np.array(f['tracings'])
    ```
    
    - The file `attributes.csv` contain basic patient attributes: sex (M or F) and age. It
    contain 827 lines (plus the header). The i-th tracing in `ecg_tracings.hdf5` correspond to the i-th line.
    - `annotations/`: folder containing annotations csv format. Each csv file contain 827 lines (plus the header).
    The i-th line correspond to the i-th tracing in `ecg_tracings.hdf5` correspond to the in all csv files.
    The csv files all have 6 columns `1dAVb, RBBB, LBBB, SB, AF, ST`
    corresponding to weather the annotator have detect the abnormality in the ECG (`=1`) or not (`=0`).
     1. `cardiologist[1,2].csv` contain annotations from two different cardiologist.
     2. `gold_standard.csv` gold standard annotation for this test dataset. When the cardiologist 1 and cardiologist 2
     agree, the common diagnosis was considered as gold standard. In cases where there was any disagreement, a 
     third senior specialist, aware of the annotations from the other two, decided the diagnosis. 
     3. `dnn.csv` prediction from the deep neural network described in 
     "Automatic Diagnosis of the Short-Duration 12-Lead ECG using a Deep Neural Network". The threshold is set in such way 
     it maximizes the F1 score.
     4. `cardiology_residents.csv` annotations from two 4th year cardiology residents (each annotated half of the dataset).
     5. `emergency_residents.csv` annotations from two 3rd year emergency residents (each annotated half of the dataset).
     6. `medical_students.csv` annotations from two 5th year medical students (each annotated half of the dataset).
    
  18. RailEnV-PASMVS: a dataset for multi-view stereopsis training and...

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, png, txt +1
    Updated Jul 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    André Broekman; André Broekman; Petrus Johannes Gräbe; Petrus Johannes Gräbe (2024). RailEnV-PASMVS: a dataset for multi-view stereopsis training and reconstruction applications [Dataset]. http://doi.org/10.5281/zenodo.5233840
    Explore at:
    bin, csv, txt, zip, pngAvailable download formats
    Dataset updated
    Jul 18, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    André Broekman; André Broekman; Petrus Johannes Gräbe; Petrus Johannes Gräbe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A Perfectly Accurate, Synthetic dataset featuring a virtual railway EnVironment for Multi-View Stereopsis (RailEnV-PASMVS) is presented, consisting of 40 scenes and 79,800 renderings together with ground truth depth maps, extrinsic and intrinsic camera parameters and binary segmentation masks of all the track components and surrounding environment. Every scene is rendered from a set of 3 cameras, each positioned relative to the track for optimal 3D reconstruction of the rail profile. The set of cameras is translated across the 100-meter length of tangent (straight) track to yield a total of 1,995 camera views. Photorealistic lighting of each of the 40 scenes is achieved with the implementation of high-definition, high dynamic range (HDR) environmental textures. Additional variation is introduced in the form of camera focal lengths, random noise for the camera location and rotation parameters and shader modifications of the rail profile. Representative track geometry data is used to generate random and unique vertical alignment data for the rail profile for every scene. This primary, synthetic dataset is augmented by a smaller image collection consisting of 320 manually annotated photographs for improved segmentation performance. The specular rail profile represents the most challenging component for MVS reconstruction algorithms, pipelines and neural network architectures, increasing the ambiguity and complexity of the data distribution. RailEnV-PASMVS represents an application specific dataset for railway engineering, against the backdrop of existing datasets available in the field of computer vision, providing the precision required for novel research applications in the field of transportation engineering.

    File descriptions

    • RailEnV-PASMVS.blend (227 Mb) - Blender file (developed using Blender version 2.8.1) used to generate the dataset. The Blender file packs only one of the HDR environmental textures to use as an example, along with all the other asset textures.
    • RailEnV-PASMVS_sample.png (28 Mb) - A visual collage of 30 scenes, illustrating the variability introduced by using different models, illumination, material properties and camera focal lengths.
    • geometry.zip (2 Mb) - Geometry CSV files used for scenes 01 to 20. The Bezier curve defines the geometry of the rail profile (10 mm intervals).
    • PhysicalDataset.7z (2.0 Gb) - A smaller, secondary dataset of 320 manually annotated photographs of railway environments; only the railway profiles are annotated.
    • 01.7z-40.7z (2.0 Gb each) - Archive of every scene (01 through 40).
    • all_list.txt, training_list.txt, validation_list.txt - Text files containing the all the scene names, together with those used for validation (validation_list.txt) and training (training_list.txt), used by MVSNet.
    • index.csv - CSV file provides a convenient reference for all the sample files, linking the corresponding file and relative data path.

    Steps to reproduce

    The open source Blender software suite (https://www.blender.org/) was used to generate the dataset, with the entire pipeline developed using the exposed Python API interface. The camera trajectory is kept fixed for all 40 scenes, except for small perturbations introduced in the form of random noise to increase the camera variation. The camera intrinsic information was initially exported as a single CSV file (scene.csv) for every scene, from which the camera information files were generated; this includes the focal length (focalLengthmm), image sensor dimensions (pixelDimensionX, pixelDimensionY), position, coordinate vector (vectC) and rotation vector (vectR). The STL model files, as provided in this data repository, were exported directly from Blender, such that the geometry/scenes can be reproduced. The data processing below is written for a Python implementation, transforming the information from Blender's coordinate system into universal rotation (R_world2cv) and translation (T_world2cv) matrices.

    import numpy as np
    from scipy.spatial.transform import Rotation as R
    
    #The intrinsic matrix K is constructed using the following formulation:
    focalLengthPixel = focalLengthmm x pixelDimensionX / sensorWidthmm
    K = [[focalLengthPixel, 0, dimX/2],
       [0, focalPixel, dimY/2],
       [0, 0, 1]]
    
    #The rotation vector as provided by Blender was first transformed to a rotation matrix:
    r = R.from_euler('xyz', vectR, degrees=True)
    matR = r.as_matrix()
    
    #Transpose the rotation matrix, to find matrix from the WORLD to BLENDER coordinate system:
    R_world2bcam = np.transpose(matR)
    
    #The matrix describing the transformation from BLENDER to CV/STANDARD coordinates is:
    R_bcam2cv = np.array([[1, 0, 0],
                   [0, -1, 0],
                   [0, 0, -1]])
    
    #Thus the representation from WORLD to CV/STANDARD coordinates is:
    R_world2cv = R_bcam2cv.dot(R_world2bcam)
    
    #The camera coordinate vector requires a similar transformation moving from BLENDER to WORLD coordinates:
    T_world2bcam = -1 * R_world2bcam.dot(vectC)
    T_world2cv = R_bcam2cv.dot(T_world2bcam)

    The resulting R_world2cv and T_world2cv matrices are written to the camera information file using exactly the same format as that of BlendedMVS developed by Dr. Yao. The original rotation and translation information can be found by following the process in reverse. Note that additional steps were required to convert from Blender's unique coordinate system to that of OpenCV; this ensures universal compatibility in the way that the camera intrinsic and extrinsic information is provided.

    Equivalent GPS information is provided (gps.csv), whereby the local coordinate frame is transformed into equivalent GPS information, centered around the Engineering 4.0 campus, University of Pretoria, South Africa. This information is embedded within the JPG files as EXIF data.

  19. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem - the dataset [Dataset]. http://doi.org/10.5281/zenodo.1297925
Organization logo

Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem - the dataset

Explore at:
text/x-python, zip, bin, application/gzipAvailable download formats
Dataset updated
Aug 2, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb
License

https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

Description
Replication pack, FSE2018 submission #164:
------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
A Case Study of the PyPI Ecosystem

**Note:** link to data artifacts is already included in the paper. 
Link to the code will be included in the Camera Ready version as well.


Content description
===================

- **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
 described below
- **settings.py** - settings template for the code archive.
- **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
 This dataset only includes stats aggregated by the ecosystem (PyPI)
- **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
 statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
 themselves, which take around 2TB.
- **build_model.r, helpers.r** - R files to process the survival data 
  (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
  `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
  **dataset_full_Jan_2018.tgz**)
- **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
- LICENSE - text of GPL v3, under which this dataset is published
- INSTALL.md - replication guide (~2 pages)
Replication guide
=================

Step 0 - prerequisites
----------------------

- Unix-compatible OS (Linux or OS X)
- Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
- R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)

Depending on detalization level (see Step 2 for more details):
- up to 2Tb of disk space (see Step 2 detalization levels)
- at least 16Gb of RAM (64 preferable)
- few hours to few month of processing time

Step 1 - software
----------------

- unpack **ghd-0.1.0.zip**, or clone from gitlab:

   git clone https://gitlab.com/user2589/ghd.git
   git checkout 0.1.0
 
 `cd` into the extracted folder. 
 All commands below assume it as a current directory.
  
- copy `settings.py` into the extracted folder. Edit the file:
  * set `DATASET_PATH` to some newly created folder path
  * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
- install docker. For Ubuntu Linux, the command is 
  `sudo apt-get install docker-compose`
- install libarchive and headers: `sudo apt-get install libarchive-dev`
- (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
 Without this dependency, you might get an error on the next step, 
 but it's safe to ignore.
- install Python libraries: `pip install --user -r requirements.txt` . 
- disable all APIs except GitHub (Bitbucket and Gitlab support were
 not yet implemented when this study was in progress): edit
 `scraper/init.py`, comment out everything except GitHub support
 in `PROVIDERS`.

Step 2 - obtaining the dataset
-----------------------------

The ultimate goal of this step is to get output of the Python function 
`common.utils.survival_data()` and save it into a CSV file:

  # copy and paste into a Python console
  from common import utils
  survival_data = utils.survival_data('pypi', '2008', smoothing=6)
  survival_data.to_csv('survival_data.csv')

Since full replication will take several months, here are some ways to speedup
the process:

####Option 2.a, difficulty level: easiest

Just use the precomputed data. Step 1 is not necessary under this scenario.

- extract **dataset_minimal_Jan_2018.zip**
- get `survival_data.csv`, go to the next step

####Option 2.b, difficulty level: easy

Use precomputed longitudinal feature values to build the final table.
The whole process will take 15..30 minutes.

- create a folder `
Search
Clear search
Close search
Google apps
Main menu