47 datasets found

Fiber object
zenodo.org
bin
Updated Dec 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mark Boltengagen; Mark Boltengagen (2024). Fiber object [Dataset]. http://doi.org/10.5281/zenodo.14545007
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14545007
Dataset updated
Dec 22, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mark Boltengagen; Mark Boltengagen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains fiber R object, stored as an RDS file for use in R package for analysis of long-read sequencing.
NYC STEW-MAP Staten Island organizations' website hyperlink webscrape
catalog.data.gov
s.cnmilf.com
Updated Nov 21, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2022). NYC STEW-MAP Staten Island organizations' website hyperlink webscrape [Dataset]. https://catalog.data.gov/dataset/nyc-stew-map-staten-island-organizations-website-hyperlink-webscrape
Explore at:
Dataset updated
Nov 21, 2022
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Area covered
Staten Island, New York
Description
The data represent web-scraping of hyperlinks from a selection of environmental stewardship organizations that were identified in the 2017 NYC Stewardship Mapping and Assessment Project (STEW-MAP) (USDA 2017). There are two data sets: 1) the original scrape containing all hyperlinks within the websites and associated attribute values (see "README" file); 2) a cleaned and reduced dataset formatted for network analysis. For dataset 1: Organizations were selected from from the 2017 NYC Stewardship Mapping and Assessment Project (STEW-MAP) (USDA 2017), a publicly available, spatial data set about environmental stewardship organizations working in New York City, USA (N = 719). To create a smaller and more manageable sample to analyze, all organizations that intersected (i.e., worked entirely within or overlapped) the NYC borough of Staten Island were selected for a geographically bounded sample. Only organizations with working websites and that the web scraper could access were retained for the study (n = 78). The websites were scraped between 09 and 17 June 2020 to a maximum search depth of ten using the snaWeb package (version 1.0.1, Stockton 2020) in the R computational language environment (R Core Team 2020). For dataset 2: The complete scrape results were cleaned, reduced, and formatted as a standard edge-array (node1, node2, edge attribute) for network analysis. See "READ ME" file for further details. References: R Core Team. (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. Version 4.0.3. Stockton, T. (2020). snaWeb Package: An R package for finding and building social networks for a website, version 1.0.1. USDA Forest Service. (2017). Stewardship Mapping and Assessment Project (STEW-MAP). New York City Data Set. Available online at https://www.nrs.fs.fed.us/STEW-MAP/data/. This dataset is associated with the following publication: Sayles, J., R. Furey, and M. Ten Brink. How deep to dig: effects of web-scraping search depth on hyperlink network analysis of environmental stewardship organizations. Applied Network Science. Springer Nature, New York, NY, 7: 36, (2022).
Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...
data.niaid.nih.gov
datadryad.org
zip
Updated Dec 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dylan Westfall; Mullins James (2023). Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies [Dataset]. http://doi.org/10.5061/dryad.w3r2280w0
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.w3r2280w0
Dataset updated
Dec 7, 2023
Dataset provided by
HIV Prevention Trials Networkhttp://www.hptn.org/
HIV Vaccine Trials Networkhttp://www.hvtn.org/
National Institute of Allergy and Infectious Diseaseshttp://www.niaid.nih.gov/
PEPFAR
Authors
Dylan Westfall; Mullins James
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies. Methods This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies" Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005 For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub. The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub. The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results. Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program. To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper. Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd. Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.
H
Additional Tennessee Eastman Process Simulation Data for Anomaly Detection...
dataverse.harvard.edu
Updated Jul 6, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harvard Dataverse (2017). Additional Tennessee Eastman Process Simulation Data for Anomaly Detection Evaluation [Dataset]. http://doi.org/10.7910/DVN/6C3JR1
Explore at:
application/x-rlang-transport(24678017)Available download formats
Unique identifier
https://doi.org/10.7910/DVN/6C3JR1
Dataset updated
Jul 6, 2017
Dataset provided by
Harvard Dataverse
License
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/6C3JR1https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/6C3JR1
Description
User Agreement, Public Domain Dedication, and Disclaimer of Liability. By accessing or downloading the data or work provided here, you, the User, agree that you have read this agreement in full and agree to its terms. The person who owns, created, or contributed a work to the data or work provided here dedicated the work to the public domain and has waived his or her rights to the work worldwide under copyright law. You can copy, modify, distribute, and perform the work, for any lawful purpose, without asking permission. In no way are the patent or trademark rights of any person affected by this agreement, nor are the rights that any other person may have in the work or in how the work is used, such as publicity or privacy rights. Pacific Science & Engineering Group, Inc., its agents and assigns, make no warranties about the work and disclaim all liability for all uses of the work, to the fullest extent permitted by law. When you use or cite the work, you shall not imply endorsement by Pacific Science & Engineering Group, Inc., its agents or assigns, or by another author or affirmer of the work. This Agreement may be amended, and the use of the data or work shall be governed by the terms of the Agreement at the time that you access or download the data or work from this Website. Description This dataverse contains the data referenced in Rieth et al. (2017). Issues and Advances in Anomaly Detection Evaluation for Joint Human-Automated Systems. To be presented at Applied Human Factors and Ergonomics 2017. Each .RData file is an external representation of an R dataframe that can be read into an R environment with the 'load' function. The variables loaded are named ‘fault_free_training’, ‘fault_free_testing’, ‘faulty_testing’, and ‘faulty_training’, corresponding to the RData files. Each dataframe contains 55 columns: Column 1 ('faultNumber') ranges from 1 to 20 in the “Faulty” datasets and represents the fault type in the TEP. The “FaultFree” datasets only contain fault 0 (i.e. normal operating conditions). Column 2 ('simulationRun') ranges from 1 to 500 and represents a different random number generator state from which a full TEP dataset was generated (Note: the actual seeds used to generate training and testing datasets were non-overlapping). Column 3 ('sample') ranges either from 1 to 500 (“Training” datasets) or 1 to 960 (“Testing” datasets). The TEP variables (columns 4 to 55) were sampled every 3 minutes for a total duration of 25 hours and 48 hours respectively. Note that the faults were introduced 1 and 8 hours into the Faulty Training and Faulty Testing datasets, respectively. Columns 4 to 55 contain the process variables; the column names retain the original variable names. Acknowledgments. This work was sponsored by the Office of Naval Research, Human & Bioengineered Systems (ONR 341), program officer Dr. Jeffrey G. Morrison under contract N00014-15-C-5003. The views expressed are those of the authors and do not reflect the official policy or position of the Office of Naval Research, Department of Defense, or US Government.
Tennessee Eastman Process Simulation Dataset
kaggle.com
zip
Updated Feb 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergei Averkiev (2020). Tennessee Eastman Process Simulation Dataset [Dataset]. https://www.kaggle.com/averkij/tennessee-eastman-process-simulation-dataset
Explore at:
zip(1370814903 bytes)Available download formats
Dataset updated
Feb 9, 2020
Authors
Sergei Averkiev
Description
Intro

This dataverse contains the data referenced in Rieth et al. (2017). Issues and Advances in Anomaly Detection Evaluation for Joint Human-Automated Systems. To be presented at Applied Human Factors and Ergonomics 2017.

Content

Each .RData file is an external representation of an R dataframe that can be read into an R environment with the 'load' function. The variables loaded are named ‘fault_free_training’, ‘fault_free_testing’, ‘faulty_testing’, and ‘faulty_training’, corresponding to the RData files.

Each dataframe contains 55 columns:

Column 1 ('faultNumber') ranges from 1 to 20 in the “Faulty” datasets and represents the fault type in the TEP. The “FaultFree” datasets only contain fault 0 (i.e. normal operating conditions).

Column 2 ('simulationRun') ranges from 1 to 500 and represents a different random number generator state from which a full TEP dataset was generated (Note: the actual seeds used to generate training and testing datasets were non-overlapping).

Column 3 ('sample') ranges either from 1 to 500 (“Training” datasets) or 1 to 960 (“Testing” datasets). The TEP variables (columns 4 to 55) were sampled every 3 minutes for a total duration of 25 hours and 48 hours respectively. Note that the faults were introduced 1 and 8 hours into the Faulty Training and Faulty Testing datasets, respectively.

Columns 4 to 55 contain the process variables; the column names retain the original variable names.

Acknowledgements

This work was sponsored by the Office of Naval Research, Human & Bioengineered Systems (ONR 341), program officer Dr. Jeffrey G. Morrison under contract N00014-15-C-5003. The views expressed are those of the authors and do not reflect the official policy or position of the Office of Naval Research, Department of Defense, or US Government.

User Agreement

By accessing or downloading the data or work provided here, you, the User, agree that you have read this agreement in full and agree to its terms.

The person who owns, created, or contributed a work to the data or work provided here dedicated the work to the public domain and has waived his or her rights to the work worldwide under copyright law. You can copy, modify, distribute, and perform the work, for any lawful purpose, without asking permission.

In no way are the patent or trademark rights of any person affected by this agreement, nor are the rights that any other person may have in the work or in how the work is used, such as publicity or privacy rights.

Pacific Science & Engineering Group, Inc., its agents and assigns, make no warranties about the work and disclaim all liability for all uses of the work, to the fullest extent permitted by law.

When you use or cite the work, you shall not imply endorsement by Pacific Science & Engineering Group, Inc., its agents or assigns, or by another author or affirmer of the work.

This Agreement may be amended, and the use of the data or work shall be governed by the terms of the Agreement at the time that you access or download the data or work from this Website.
Flower Phenology Formatted for R
figshare.com
application/csv
Updated Aug 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Molly Jacobson (2024). Flower Phenology Formatted for R [Dataset]. http://doi.org/10.6084/m9.figshare.26460880.v1
Explore at:
application/csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26460880.v1
Dataset updated
Aug 15, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Molly Jacobson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A csv file containing wetland plant flowering phenology data, pooled from observations in 2019 and 2020 at wetlands in Central New York, arranged in a numeric week format able to be read by R and produce a visual graphic using the 'geom_line' function in package ggplot2.
H
Survey of Income and Program Participation (SIPP)
dataverse.harvard.edu
Updated May 30, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anthony Damico (2013). Survey of Income and Program Participation (SIPP) [Dataset]. http://doi.org/10.7910/DVN/I0FFJV
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/I0FFJV
Dataset updated
May 30, 2013
Dataset provided by
Harvard Dataverse
Authors
Anthony Damico
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
analyze the survey of income and program participation (sipp) with r if the census bureau's budget was gutted and only one complex sample survey survived, pray it's the survey of income and program participation (sipp). it's giant. it's rich with variables. it's monthly. it follows households over three, four, now five year panels. the congressional budget office uses it for their health insurance simulation . analysts read that sipp has person-month files, get scurred, and retreat to inferior options. the american community survey may be the mount everest of survey data, but sipp is most certainly the amazon. questions swing wild and free through the jungle canopy i mean core data dictionary. legend has it that there are still species of topical module variables that scientists like you have yet to analyze. ponce de león would've loved it here. ponce. what a name. what a guy. the sipp 2008 panel data started from a sample of 105,663 individuals in 42,030 households. once the sample gets drawn, the census bureau surveys one-fourth of the respondents every four months, over f our or five years (panel durations vary). you absolutely must read and understand pdf pages 3, 4, and 5 of this document before starting any analysis (start at the header 'waves and rotation groups'). if you don't comprehend what's going on, try their survey design tutorial. since sipp collects information from respondents regarding every month over the duration of the panel, you'll need to be hyper-aware of whether you want your results to be point-in-time, annualized, or specific to some other period. the analysis scripts below provide examples of each. at every four-month interview point, every respondent answers every core question for the previous four months. after that, wave-specific addenda (called topical modules) get asked, but generally only regarding a single prior month. to repeat: core wave files contain four records per person, topical modules contain one. if you stacked every core wave, you would have one record per person per month for the duration o f the panel. mmmassive. ~100,000 respondents x 12 months x ~4 years. have an analysis plan before you start writing code so you extract exactly what you need, nothing more. better yet, modify something of mine. cool? this new github repository contains eight, you read me, eight scripts: 1996 panel - download and create database.R 2001 panel - download and create database.R 2004 panel - download and create database.R 2008 panel - download and create database.R since some variables are character strings in one file and integers in anoth er, initiate an r function to harmonize variable class inconsistencies in the sas importation scripts properly handle the parentheses seen in a few of the sas importation scripts, because the SAScii package currently does not create an rsqlite database, initiate a variant of the read.SAScii function that imports ascii data directly into a sql database (.db) download each microdata file - weights, topical modules, everything - then read 'em into sql 2008 panel - full year analysis examples.R< br /> define which waves and specific variables to pull into ram, based on the year chosen loop through each of twelve months, constructing a single-year temporary table inside the database read that twelve-month file into working memory, then save it for faster loading later if you like read the main and replicate weights columns into working memory too, merge everything construct a few annualized and demographic columns using all twelve months' worth of information construct a replicate-weighted complex sample design with a fay's adjustment factor of one-half, again save it for faster loading later, only if you're so inclined reproduce census-publish ed statistics, not precisely (due to topcoding described here on pdf page 19) 2008 panel - point-in-time analysis examples.R define which wave(s) and specific variables to pull into ram, based on the calendar month chosen read that interview point (srefmon)- or calendar month (rhcalmn)-based file into working memory read the topical module and replicate weights files into working memory too, merge it like you mean it construct a few new, exciting variables using both core and topical module questions construct a replicate-weighted complex sample design with a fay's adjustment factor of one-half reproduce census-published statistics, not exactly cuz the authors of this brief used the generalized variance formula (gvf) to calculate the margin of error - see pdf page 4 for more detail - the friendly statisticians at census recommend using the replicate weights whenever possible. oh hayy, now it is. 2008 panel - median value of household assets.R define which wave(s) and spe cific variables to pull into ram, based on the topical module chosen read the topical module and replicate weights files into working memory too, merge once again construct a replicate-weighted complex sample design with a...
Brazil Imports: NCM: fob: Discs F/Laser Read.Syst.Which May Be...
ceicdata.com
Updated Dec 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CEICdata.com (2022). Brazil Imports: NCM: fob: Discs F/Laser Read.Syst.Which May Be Record.Once(Cd-R) [Dataset]. https://www.ceicdata.com/en/brazil/hs85-electrical-machinery-and-equipment-and-parts-thereof-others-imports-value/imports-ncm-fob-discs-flaser-readsystwhich-may-be-recordoncecdr
Explore at:
Dataset updated
Dec 15, 2022
Dataset provided by
CEIC Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Feb 1, 2006 - Jan 1, 2007
Area covered
Brazil
Variables measured
Merchandise Trade
Description
Brazil Imports: NCM: fob: Discs F/Laser Read.Syst.Which May Be Record.Once(Cd-R) data was reported at 1.129 USD mn in Jan 2007. This records a decrease from the previous number of 7.819 USD mn for Dec 2006. Brazil Imports: NCM: fob: Discs F/Laser Read.Syst.Which May Be Record.Once(Cd-R) data is updated monthly, averaging 1.129 USD mn from Jan 2004 (Median) to Jan 2007, with 37 observations. The data reached an all-time high of 8.012 USD mn in Oct 2006 and a record low of 0.102 USD mn in Jan 2004. Brazil Imports: NCM: fob: Discs F/Laser Read.Syst.Which May Be Record.Once(Cd-R) data remains active status in CEIC and is reported by Special Secretariat for Foreign Trade and International Affairs. The data is categorized under Brazil Premium Database’s Foreign Trade – Table BR.NCM: HS85: Electrical Machinery and Equipment and Parts Thereof; Others: Imports: Value.
f = open('24_demo.txt', 'r') print(f.read())
kaggle.com
Updated Apr 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SuperAI_2021 (2021). f = open('24_demo.txt', 'r') print(f.read()) [Dataset]. https://www.kaggle.com/superai2021/f-open24-demotxt-r-printfread/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 22, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
SuperAI_2021
Description
Dataset

This dataset was created by SuperAI_2021

Contents
f
R script for reading hdf5 files
springernature.figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pavel Pořízka; Jozef Kaiser; Erik Képeš; Jakub Vrábel (2023). R script for reading hdf5 files [Dataset]. http://doi.org/10.6084/m9.figshare.11316587.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.11316587.v1
Dataset updated
May 30, 2023
Dataset provided by
figshare
Authors
Pavel Pořízka; Jozef Kaiser; Erik Képeš; Jakub Vrábel
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
R script for reading hdf5 files
H
R scripts for Hamiltonian Monte Carlo and Coalescence Towards Perfect...
dataverse.harvard.edu
bin, txt +1
Updated May 20, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harvard Dataverse (2020). R scripts for Hamiltonian Monte Carlo and Coalescence Towards Perfect Simulation of General Continuous Distributions [Dataset]. http://doi.org/10.7910/DVN/9IK577
Explore at:
bin(18496), txt(7506), type/x-r-syntax(2474)Available download formats
Unique identifier
https://doi.org/10.7910/DVN/9IK577
Dataset updated
May 20, 2020
Dataset provided by
Harvard Dataverse
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
R scripts to accompany the publication listed below. The methodology and results demonstrate for the first time the feasibility of coalescence and near-perfect simulation for fairly general continuous distributions. This constitutes a quantum leap in MCMC convergence testing by decoupling the convergence error from experimental error and providing, for a given set of starting points, an exact number of iterations in which coalescence occurs.
d
Census block internal point coordinates and weights formatted specifically...
catalog.data.gov
Updated Sep 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OP,ORPM (2023). Census block internal point coordinates and weights formatted specifically for use in R code of the Environmental Justice Analysis Multisite (EJAM) tool, USA, 2020, EPA, EPA AO OP ORPM [Dataset]. https://catalog.data.gov/dataset/census-block-internal-point-coordinates-and-weights-formatted-specifically-for-use-in-r-co
Explore at:
Dataset updated
Sep 8, 2023
Dataset provided by
OP,ORPM
Area covered
United States
Description
This is Census 2020 block data specifically formatted for use by the Environmental Protection Agency (EPA) in-development Environmental Justice Analysis Multisite (EJAM) tool, which uses R code to find which block centroids are within X miles of each specified point (e.g., regulated facility), and to find those distances. The datasets have latitude and longitude of each block's internal point, as provided by Census Bureau, and the FIPS code of the block and its parent block group. The datasets also include a weight for each block, representing this block's Census 2020 population count as a fraction of the count for the parent block group overall, for use in estimating how much of a given block group is within X miles of a specified point or inside a polygon of interest. The datasets also have an effective radius of each block, which is what the radius would be in miles if the block covered the same area in square miles but were circular. The datasets also have coordinates in units that facilitate building a quadtree index of locations. They are in R data.table format, saved as .rda or .arrow files to be read by R code.
f
migrate-n input files
figshare.com
txt
Updated Jul 21, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samuel Revolinski (2021). migrate-n input files [Dataset]. http://doi.org/10.6084/m9.figshare.14976729.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14976729.v1
Dataset updated
Jul 21, 2021
Dataset provided by
figshare
Authors
Samuel Revolinski
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These files are the migrate input created using the R files to read in vcf and read out in migrate-n useable format.
s
Kioxia KPM6XRUG15T3 PM6-R Series 15.36TB SAS 24Gb/s Read Intensive BiCS...
serverevolution.com
csv
Updated Nov 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Kioxia KPM6XRUG15T3 PM6-R Series 15.36TB SAS 24Gb/s Read Intensive BiCS Flash TLC (SIE / PLP) U.2 2.5-inch Solid State Drive (SSD) [Dataset]. https://serverevolution.com/kioxia-kpm6xrug15t3-internal-ssd
Explore at:
csvAvailable download formats
Dataset updated
Nov 1, 2024
Description
Buy Kioxia KPM6XRUG15T3 PM6-R Series 15.36TB SAS 24Gb/s Read Intensive BiCS Flash TLC (SIE / PLP) U.2 2.5-inch So at serverevolution.com and avail 10% discount.
Dataset and R code for 'Do Morphometric Data Improve Phylogenetic...
zenodo.org
explore.openaire.eu
Updated Oct 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emma Jane Holvast; Mélina Anouche Celik; Matthew James Phillips; Laura Anne Balfour Wilson; Emma Jane Holvast; Mélina Anouche Celik; Matthew James Phillips; Laura Anne Balfour Wilson (2024). Dataset and R code for 'Do Morphometric Data Improve Phylogenetic Reconstruction? A Systematic Review and Assessment' [Dataset]. http://doi.org/10.5281/zenodo.13357792
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.13357792
Dataset updated
Oct 7, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Emma Jane Holvast; Mélina Anouche Celik; Matthew James Phillips; Laura Anne Balfour Wilson; Emma Jane Holvast; Mélina Anouche Celik; Matthew James Phillips; Laura Anne Balfour Wilson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset of tree (.tre) files and R code for running generalized Robinson-Foulds distance (Smith, 2020a;b) analysis.

The .tre files can be read into R (R Core Team., 2023) using the ape::read.tree function (Paradis et al., 2003), full details in R code file.

Paradis, E., Claude, J., & Strimmer, K. (2004). APE: analyses of phylogenetics and evolution in R language. Bioinformatics, 20(2), 289-290.

R Core Team. (2023). R: A Language and Environment for Statistical Computing. (Version 4.2.2). R Foundation for Statistical Computing, Vienna, Austria: https://www.R-project.org/.

Smith, M. R. (2020a). Information theoretic generalized Robinson–Foulds metrics for comparing phylogenetic trees. Bioinformatics, 36(20), 5007-5013. https://doi.org/10.1093/bioinformatics/btaa614

Smith, M. R. (2020b). TreeDist: distances between phylogenetic trees. R package version 2.7.0. doi:10.5281/zenodo.3528124.
f
Read-me file for R script from Microbiome and photoperiod interactively...
rs.figshare.com
txt
Updated Nov 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jakob K. Giesler; Tilmann Harder; Sylke Wohlrab (2023). Read-me file for R script from Microbiome and photoperiod interactively determine thermal sensitivity of polar and temperate diatoms [Dataset]. http://doi.org/10.6084/m9.figshare.24534276.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24534276.v1
Dataset updated
Nov 15, 2023
Dataset provided by
The Royal Society
Authors
Jakob K. Giesler; Tilmann Harder; Sylke Wohlrab
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description file for the provided code to re-run the statistical analyses
H
Consumer Expenditure Survey (CE)
dataverse.harvard.edu
Updated May 30, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anthony Damico (2013). Consumer Expenditure Survey (CE) [Dataset]. http://doi.org/10.7910/DVN/UTNJAH
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/UTNJAH
Dataset updated
May 30, 2013
Dataset provided by
Harvard Dataverse
Authors
Anthony Damico
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
analyze the consumer expenditure survey (ce) with r the consumer expenditure survey (ce) is the primo data source to understand how americans spend money. participating households keep a running diary about every little purchase over the year. those diaries are then summed up into precise expenditure categories. how else are you gonna know that the average american household spent $34 (±2) on bacon, $826 (±17) on cellular phones, and $13 (±2) on digital e-readers in 2011? an integral component of the market basket calculation in the consumer price index, this survey recently became available as public-use microdata and they're slowly releasing historical files back to 1996. hooray! for a t aste of what's possible with ce data, look at the quick tables listed on their main page - these tables contain approximately a bazillion different expenditure categories broken down by demographic groups. guess what? i just learned that americans living in households with $5,000 to $9,999 of annual income spent an average of $283 (±90) on pets, toys, hobbies, and playground equipment (pdf page 3). you can often get close to your statistic of interest from these web tables. but say you wanted to look at domestic pet expenditure among only households with children between 12 and 17 years old. another one of the thirteen web tables - the consumer unit composition table - shows a few different breakouts of households with kids, but none matching that exact population of interest. the bureau of labor statistics (bls) (the survey's designers) and the census bureau (the survey's administrators) have provided plenty of the major statistics and breakouts for you, but they're not psychic. if you want to comb through this data for specific expenditure categories broken out by a you-defined segment of the united states' population, then let a little r into your life. fun starts now. fair warning: only analyze t he consumer expenditure survey if you are nerd to the core. the microdata ship with two different survey types (interview and diary), each containing five or six quarterly table formats that need to be stacked, merged, and manipulated prior to a methodologically-correct analysis. the scripts in this repository contain examples to prepare 'em all, just be advised that magnificent data like this will never be no-assembly-required. the folks at bls have posted an excellent summary of what's av ailable - read it before anything else. after that, read the getting started guide. don't skim. a few of the descriptions below refer to sas programs provided by the bureau of labor statistics. you'll find these in the C:\My Directory\CES\2011\docs directory after you run the download program. this new github repository contains three scripts: 2010-2011 - download all microdata.R lo op through every year and download every file hosted on the bls's ce ftp site import each of the comma-separated value files into r with read.csv depending on user-settings, save each table as an r data file (.rda) or stat a-readable file (.dta) 2011 fmly intrvw - analysis examples.R load the r data files (.rda) necessary to create the 'fmly' table shown in the ce macros program documentation.doc file construct that 'fmly' table, using five quarters of interviews (q1 2011 thru q1 2012) initiate a replicate-weighted survey design object perform some lovely li'l analysis examples replicate the %mean_variance() macro found in "ce macros.sas" and provide some examples of calculating descriptive statistics using unimputed variables replicate the %compare_groups() macro found in "ce macros.sas" and provide some examples of performing t -tests using unimputed variables create an rsqlite database (to minimize ram usage) containing the five imputed variable files, after identifying which variables were imputed based on pdf page 3 of the user's guide to income imputation initiate a replicate-weighted, database-backed, multiply-imputed survey design object perform a few additional analyses that highlight the modified syntax required for multiply-imputed survey designs replicate the %mean_variance() macro found in "ce macros.sas" and provide some examples of calculating descriptive statistics using imputed variables repl icate the %compare_groups() macro found in "ce macros.sas" and provide some examples of performing t-tests using imputed variables replicate the %proc_reg() and %proc_logistic() macros found in "ce macros.sas" and provide some examples of regressions and logistic regressions using both unimputed and imputed variables replicate integrated mean and se.R match each step in the bls-provided sas program "integr ated mean and se.sas" but with r instead of sas create an rsqlite database when the expenditure table gets too large for older computers to handle in ram export a table "2011 integrated mean and se.csv" that exactly matches the contents of the sas-produced "2011 integrated mean and se.lst" text file click here to view these three scripts for...
Survival and Habitat Quality (SHQ) analysis data and code
zenodo.org
zip
Updated Apr 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anna K. Moeller; Anna K. Moeller; Molly C. McDevitt; Molly C. McDevitt; Andrew J. Lindbloom; Andrew J. Lindbloom; Paul M. Lukacs; Paul M. Lukacs (2025). Survival and Habitat Quality (SHQ) analysis data and code [Dataset]. http://doi.org/10.5281/zenodo.15242216
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15242216
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anna K. Moeller; Anna K. Moeller; Molly C. McDevitt; Molly C. McDevitt; Andrew J. Lindbloom; Andrew J. Lindbloom; Paul M. Lukacs; Paul M. Lukacs
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 15, 2025
Description
This archive contains the source code and data required for reproducing the analyses and figures from the manuscript "A lifetime of experiences: Modeling habitat quality through cumulative effects on individual survival", accepted for publication by the journal Methods in Ecology and Evolution on 2025-04-07. Code and data are included to:

Perform SHQ on simulated data
The code to simulate data for SHQ is in R/Simulate data.R. The model code to run SHQ is in models/SHQ.txt. The MCMC results from running SHQ on simulated datasets with and without missing values are in the results folder under "SHQsim_withNAs" and "SHQsim_noNAs". These results files can be read into R using the Plotting simulations.R script, which then has code for creating Figure 2.

Compare known fate vs. SHQ
Code to simulate data for known fate survival analysis is in R/Simulate data.R. The model code to run known fate survival analysis is in models/Known_fate.txt. The MCMC results from running SHQ and known fate survival analysis on the same simulated dataset are saved as .RDS files in the folder "KFvsSHQ" in the results folder. These results files can be read into R using the Plotting simulations.R script, which then has code for creating Figure 3.

Complete pronghorn case study
The GPS collar data for the pronghorn case study are formatted into encounter histories (data/eh.rds). The encounter histories record each day that the pronghorn was alive (1) and dead (0). Fate data are included are formatted to record each animalID and cause of mortality (for each collared (data/fates.rds). Covariate data include daily covariate values for each animalID (data/covariatedata.rds). The code to format encounter histories, fate, and encounter histories is found in R/Pronghorn case study.R. "R/Pronghorn case study.R" has code to prepare data for and run the SHQ analysis. This script also includes the code to create Figures 4, 5, and 6 from those results.
o
Date Read
opencontext.org
Updated Nov 28, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David K. Pettegrew; William R. Caraher; R. Scott Moore (2021). Date Read [Dataset]. https://opencontext.org/predicates/076a9d4f-f2c4-40ed-d60d-b2551f1d0bb9
Explore at:
Dataset updated
Nov 28, 2021
Dataset provided by
Open Context
Authors
David K. Pettegrew; William R. Caraher; R. Scott Moore
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
An Open Context "predicates" dataset item. Open Context publishes structured data as granular, URL identified Web resources. This "Variables" record is part of the "Pyla-Koutsopetria Archaeological Project I: Pedestrian Survey" data publication.
d
GLO Ecological expert elicitation and receptor impact models v01
data.gov.au
researchdata.edu.au
+1more
Updated Nov 20, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bioregional Assessment Program (2019). GLO Ecological expert elicitation and receptor impact models v01 [Dataset]. https://data.gov.au/data/dataset/groups/76fb9d24-b8db-4251-b944-f69f983507ff
Explore at:
Dataset updated
Nov 20, 2019
Dataset provided by
Bioregional Assessment Program
Description
Abstract

The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

Receptor impact models (RIMs) use inputs from surface water and groundwater models. For a given node, there is a value for each combination of hydrological response variable, future, and replicate or run number. RIMs are developed for specific landscape classes. The hydrological response variables that a RIM within a landscape class requires are organised by the R script RIM_Prediction_CreateArray.R into an array. The formatted data is available as an R data file format called RDS and can be read directly into R. The R script IMIA_XXX_RIM_predictions.R applies the receptor model functions (RDS object as part of Data set 1: Ecological expert elicitation and receptor impact models for the XXX subregion) to the HRV array for each landscape class (or landscape group) to make predictions of receptor impact varibles (RIVs). Predictions of a receptor impact from a RIM for a landscape class are summarised at relevant AUIDs by the 5th through to the 95th percentiles (in 5% increments) for baseline and CRDP futures. These are available in the XXX_RIV_quantiles_IMIA.csv data set. RIV predictions are further summarised and compared as boxplots (using the R script boxplotsbyfutureperiod.R) and as (aggregated) spatial risk maps using GIS.

Dataset History

Receptor impact models (RIMs) are developed for specific landscape classes. The hydrological response variables that a RIM within a landscape class requires are organised by the R script RIM_Prediction_CreateArray.R into an array. The formatted data is available as an R data file format called RDS and can be read directly into R.

The R script IMIA_XXX_RIM_predictions.R applies the receptor model functions (RDS object as part of Data set 1: Ecological expert elicitation and receptor impact models for the XXX subregion) to the HRV array for each landscape class (or landscape group) to make predictions of receptor impact varibles (RIVs). Predictions of a receptor impact from a RIM for a landscape class are summarised at relevant AUIDs by the 5th through to the 95th percentiles (in 5% increments) for baseline and CRDP futures. These are available in the XXX_RIV_quantiles_IMIA.csv data set. RIV predictions are further summarised and compared as boxplots (using the R script boxplotsbyfutureperiod.R) and as (aggregated) spatial risk maps using GIS.

Dataset Citation

Bioregional Assessment Programme (XXXX) GLO Ecological expert elicitation and receptor impact models v01. Bioregional Assessment Derived Dataset. Viewed 12 July 2018, http://data.bioregionalassessments.gov.au/dataset/76fb9d24-b8db-4251-b944-f69f983507ff.

Dataset Ancestors

Derived From Groundwater Dependent Ecosystems supplied by the NSW Office of Water on 13/05/2014

Derived From Greater Hunter Native Vegetation Mapping with Classification for Mapping

Derived From BA ALL mean annual flow for NSW - Choudhury implementation of Budyko runoff v01

Derived From Bioregional Assessment areas v06

Derived From Bioregional Assessment areas v04

Derived From Bioregional Assessment areas v02

Derived From Gippsland Project boundary

Derived From Natural Resource Management (NRM) Regions 2010

Derived From GLO subregion boundaries for Impact and Risk Analysis 20160712 v01

Derived From GEODATA TOPO 250K Series 3, File Geodatabase format (.gdb)

Derived From Bioregional_Assessment_Programme_Catchment Scale Land Use of Australia - 2014

Derived From GEODATA TOPO 250K Series 3

Derived From Australian Geological Provinces, v02

Derived From NSW Catchment Management Authority Boundaries 20130917

Derived From Geological Provinces - Full Extent

Derived From GLO Preliminary Assessment Extent

Derived From Australian Coal Basins

Derived From Gloucester River Types v01

Derived From Bioregional Assessment areas v03

Derived From Bioregional Assessment areas v05

Derived From BILO Gridded Climate Data: Daily Climate Data for each year from 1900 to 2012

Derived From GLO Landscape Classification 20161219 v05

Derived From Gloucester river types V02

Derived From Gloucester Coal Basin

Derived From Greater Hunter Native Vegetation Mapping

Derived From Mean Annual Climate Data of Australia 1981 to 2012

Derived From Subcatchment boundaries within and nearby the Gloucester subregion

Derived From Bioregional Assessment areas v01

Derived From Geofabric Hydrology Reporting Catchments - V2.1

Derived From Victoria - Seamless Geology 2014